


THE ANNALS 


of 


MATHEMATICAL 
STATISTICS 


NOVEMBER, 1930 


Copyright 1930 


ae 
4 
Sh 
Published and Lithoprinted by 


EDWARDS BROTHERS, Inc. 
ANN ARBOR, MICH 


(Printed in U.S. A.) 








EDITORIAL COMMITTEE 


H. C. Carver, Editor 

B. L. Shook, Assistant Editor 

J. Shohat, Foreign Editor 

J. W. Edwards, Business Manager 


A quarterly publication sponsored by the American Statistical Association, 
devoted to the theory and application of Mathematical Statistics. 


The vates are six dollars per annum 


Reprints of any article in this issue may be obtained at any time 
from the Editor at the following rates, postage included. 


Number of copies Cost per page 
1- 4 ‘ 2 cents 
5-24 . ~ 1% centa 
25-49 “ * 1 cent 
60 and over . % cent 


Appress: Editor, Annals of Mathematical Statistics 
Post Office Box 171, Ann Arbor, Michigan 





CONTENTS OF NO. 4 


Pace 


The Sampling Variability of Linear and Curvilinear Regressions 275 


Mordecai Ezekiel 


Transformations of Bimodal Distributions 


G. A. Baker 


Error and Unreliability in Seasonals 


Edgar Z. Palmer 


Modifications of the Link Relative and Interpolation Methods of 
Determining Seasonal Variation 


Richard A. Robb 








THE SAMPLING VARIABILITY OF LINEAR AND 
CURVILINEAR REGRESSIONS 


A FIRST APPROXIMATION TO THE RELIABILITY OF THE 
RESULTS SECURED BY THE GRAPHIC “SUC- 
CESSIVE APPROXIMATION” METHOD 


By 


Morpecal EZEKIEL! 


Many statistical problems involve determining the change in one 
varial'2 with changes in each of several. others, all operating at the 
same time. I.inear multiple correlation provides a method of making 
this determination, on the assumption that all the relations are linear. 
In many problems this assumption is not valid. To determine curvi- 
linear relations without making assumptions as to the type of each 
curve except that it be a continuous function, a method of successive 
approximations by graphic fitting was presented six years ago; and it 
was demonstrated empirically that in cases of high correlation this 
method successfully determined the underlying curves.’ It was also 
pointed out that multiple regression curves could be fitted by the least- 
squares method. if specific parabolae or other first-degree equations 
were assumed for each variable, following methods previously sug- 
suggested by Yule.’ 


1. Formerly Senior Agricultural Economist, United States Department of Agri- 
culture. 

2 Ezekiel, Mordecai. A Method of Handling Curvilinear Correlation for any 
Number of Variables. Quart. Pub., Amer. Stat. Assoc., XIX, No. 148, Dec., 
1924. 

3. Yule, G. U. “On the Theory of Correlation,” Jour. Roy. Sta. Soc., Vol. LX, 
p. 817 (1897). Apparently Wicksell had also suggested fitting regression curves 
to several variables simultaneously. Wicksell, S. D., Annals of Math. Stat., 
Vol. I, No. 1, pp. 3-15. Feb., 1930. 



















276 





SAMPLING VARIABILITY OF REGRESSIONS 





The advantage claimed for the successive approximation method 
was that it did not require assumptions as to the specific type of each 
curve, but instead permitted each regression to be indicated by the ob- 
servations themselves. 

A new measure, the “index of multiple correlation,” was suggested 
to measure correlation for curvilinear regressions in the same way 
that the coefficients of multiple correlation measured it for linear re- 
gressions. 


No measure of the reliability of the net regression curves or of 
the index of correlation, was provided in the initial article. The use- 
fulness of the results secured by this method has therefore been lim- 
ited by the inability to state the confidence that could be placed in them 
even when based on a random sample, or to judge how large a sample 
would be necessary to infer, within any stated limits of precision and 
probability, the relations existing in the universe from which that 
sample was drawn. 






























This paper reports an attempt to determine the sampling error 
of multiple regression curves and indexes of correlation obtained by 
the successive approximation process, under conditions of simple 
sampling . The experimental method has been used to investigate the 
variakility of results from successive samples drawn from the same 
universe under specified conditions and to establish error formulae in- 
ductively. These experiments, representing the solution of over 150 
multiple curvilinear correlation problems, indicate the possibility of es- 
tablishing approximate expressions for the reliability of multiple re- 
gression curves and indexes of multiple correlation.’ The results, how- 
ever, are not fully consistent, and the error formulae are not com- 
pletely satisfactory. The experimental results are therefore given in 
full, in the hope that the attention of mathematicians may be attracted 
to this problem, and that the tentative formulae may be modified to 
provide more rigorous and exact measures of the reliability of the 
curvilinear regressions and correlations. 





1, The extensive computations involved in this investigation were carried through 
by Helen L. Lee and Della E. Merrick, and by others of the staff of the Division 
of Farm Management, U. S. Department of Agriculture. Credit is due them 

for their intelligent and loyal assistance. 


PART I.—-COEFFICIENTS AND INDEXES OF 
CORRELATION. 


1. Tue REDUCTION OF THE “DEGREES OF FREEDOM” BY 
FREE-HAND SMOOTHING. 


When a line is fitted to a series of paired observations by the use 
of the formulae Y=a+bX  , the assumption is made that the 
straight regression line is adequate to describe the relation. Two para- 
meters. one giving position to the line and the other slope, are required. 
For that reason, this equation will give a perfect fit to any two pairs 
of observations of X and Y. Furthermore, if the line is fitted to 
four pairs of observations, the determination of two parameters from 
four observations reduces the degree of freedom in obtaining the line 
from four to two; and the standard errors of the parameters must 
be determined with the number of degrees of freedom, WV , equal to 
2 instead of 4. Similarly, if a cubic parabola Y=a+bX+cX%4dX" 
were fitted to ten observations, there would be only 6 degrees of free- 
dom after determining the four parameters, and the standard errors 
would be based on AMV/=6. In this case the four parameters determine 
position. slope, rate of change, and change in the rate of change.’ 


If instead of fitting a curve by the method of least squares or 
some other exact method, a free-hand curve is drawn by eye through 
the series of observations, it is necessary to make certain assumptions 
in drawing the curve, analogous to those represented in the parameters 
when more rigid methods are used. In addition to the basic assump- 
tion of continuity, these conditions may include: 


Whether the origin for X=O will be at Y=O or at some 
ordinate to be indicated by the data. | 


Whether a straight line will be fitted (by ruler or thread) or 
whether a curve will be permitted. 


1. The treatment of standard errors for small samples by “Student” and R. A, 
Fisher, as set forth in the latter’s “Statistical Methods for Research Workers,” 
give full recognition to these facts. Least square theory has always recog- 
nized that, for small samples, the number of parameters determined reduced 
the number of observations. See Wright, Thomas Wallace, and John Fillmore 
Hayford, “Adjustments of Observations,” 1905, pp. 24-40, 132-133, and Merri- 
man, Mansfield, ‘Method of Least Squares,” 1911, pp. 80-82. 





278 SAMPLING VARIABILITY OF REGRESSIONS 


(3) Ifacurve, whether it will be limited (a) to a continuous arc of 
even curvature, (b) to a continuous parabola-like curve, (c) 
whether one or more inflections will be permitted, (d) whether 
the line will be so drawn as to minimize departures on the Y -axis, 
the X -axis, or at right-angles to the line itself. 


It is evident that if a curve is drawn free-hand with its initial 
ordinate as indicated by the observations, with a continuous changing 
rate of curvature, and with no inflection, at least the three parameters 
of position, slope, and rate of change of curvature are represented, as 
shown by the corresponding equation for a parabola. 


Ysa+bX+cX" 


It is true that the free-hand curve may involve still more para- 
meters, but three is the minimum. While the number of parameters 
represented in any free-hand curve cannot be exactly determined, it can 
be roughly estimated by a process of reasoning similar to that indicated 
above; and any measure of the sampling reliability of such free-hand 
curves would be more reliable if it allowed for the number of para- 
meters assumed than if it ignored this reduction of the degrees of 
freedom. 


It should be noted that while the process of fitting curves free- 
hand involves the “taste” of the investigator, represented in the con- 
ditions he places on himself as previously mentioned, and on his skill 
in drawing the line under those conditions, the process of fitting a curve 
by a mathematical formula also involves “taste” in deciding what for- 
mula to use. If the conditions placed on the free-hand fitting are the 
same as those represented in the mathematical equation, the results may 
agree within the signifjcant limits of error, and, therefore, cither may 
be satisfactory for practical purposes.’ 


When coefficients of correlation or coefficients of multiple correla- 
tion are obtained from samples with a limited. number of cases, the 
reduction in the number of degrees of freedom by the two or more 
parameters in the regression equation makes the observed correlation 


1. Note the witty discussion of free-hand versus mathematical curves in the pres- 
idential address by E. B. Wilson, Proceedings: American Statistical Associa- 
tion, March, 1930. 





M. EZEKIEL 279 


tend to exceed the true correlation in the universe from which the 
sample was obtained. Accordingly, even the usual linear correlation 
coefficients, if obtained from small samples, tends to exceed the true 
values. Adjustments to correct for this factor will be considered before 
going to the more complicated problem of adjustments in observed in- 
dexes of correlation. 


2. Bias in COEFFICIENTS OF CORRELATION 


Determining a coefficient of correlation from a finite sample re- 
duces by 2 the number of degrees of freedom present. As a conse- 
quence, there is a tendency for the computed correlation to exceed the 
true correlation in the universe, and a corresponding tendency for the 
computed standard error of estimate to fall below the true value. Exact 
measures of the “most likely” value of the correlation coefficient were 


given by Soper and others in 1917’ and an elaborate method was pro- 
vided for estimating it.” 


Where a coefficient of multiple correlation for 1, independent 
variables is determined from a finite sample of nm’ independent obser- 
vations, the degrees of freedom are reduced by the 1, + 1 parameters 
represented in the regression equation. If n‘=n,+1, the number 
of observations exactly equals the number of parameters to be obtained, 
the least square solution reduces to a simultaneous solution of the n’ 
observation equations, and the coefficient of multiple correlation comes 


out 1.00 regardless of the presence or absence of correlation in the 
universe. 


R. A. Fisher called attention to this problem in 1924 and suggested 
an approximate adjustment of the observed correlaticn from limited 
samples by the equation 


~ + s.st. (- 2°) 


n'—n,-/ 


1. Soper, H. E., Young, A. W., Cave, B. M.. Lee, A., and Pearson, K. On the 
Distribution of the Correlation Coefficient in Small Sampies. A cooperative 
Study. Biometrika. Vol. XI, Part IV, May, 1917, pages 352-359. 


2. Locus, Cit., pp. 374-375. 





280 SAMPLING VARIABILITY OF REGRESSIONS 


where and /, have the same meanings as above, R is the correla- 
tion observed in the sample, and & is the most probable correlation in 
the universe.’ This correction is very similar to that deduced inde- 
pendently by B. B. Smith in 1925, directly from the least square ad- 
justment for number of constants. In the same notation as above, 
Smith’s adjustment: 


may be stated 


2 /-R?= —2_(-p* 

(2) al 

which differs from Fisher’s formula only by the omission of the -1 

from both numerator and denominator. In restating this formula a year 

ago’ the present author modified it to the form 

1—-R* 
n,+1 


j- 4; 
n 


R*=1- 


or, stated in the same form as (1) and (2) 


’ 


a . , 
(3) /-R?*= Aon, OR ) 


This differs from both the previous equations in including the -1l 
term in the denominator but not in the numerator. The effect is thus 
to make the correction most severe; i. e., the corrected value departs 
still more from the uncorrected value than in either of the other forms. 


1. Fisher, R. A., The Influence of Rainfall on the Yield of Wheat at Rotham- 
stead—Phil. Trans. B. ccxiii, 89-142; 1924. 


2. Smith, B. B. Forecasting the Acreage of Cotton. Jour. Amer. Stat. Assoc.. 
March, 1925. Footnote on p. 41. 


3. Ezekiel, Mordecai. Application of the Theory of Error to Multiple and Curvi- 
linear correlation. Jour. Amer. Stat. Assoc., Supp., pp. 99-104, Vol. XXIV, 
No. 165-A. March, 1929. 





M. EZEKIEL 281 


The interpretation of correlation coefficients adjusted by any one 
of the three equations (1), (2), or (3) has been difficult because of 
lack of a definite explanation of the meaning of the adjusted coefficients. 
To determine their exact meaning, and to decide which one of the three 
forms of adjustment is most satisfactory, a study has been made of 
the relation of the adjusted values to the distribution of simple corre- 
lation coefficients when computed from random samples of various 
sizes drawn from universes with specified correlations. The ‘“Cooper- 
ative Study” gives tables showing the exact theoretical frequency 
curves for zero order correlation coefficients, computed from samples 
of from 3 to 25, and 50, 100, and 400 observations, for true correlations 
ranging from 0 to .9, by tenths. Ordinates of the distributions of 
observed correlations are given for each value from /=-1.00 to 1.00 
by .05 steps. With the frequency curve thus defined by as many as 41 
ordinates, a rough integral of the curve was constructed by a‘cumu- 
lative summary of the ordinates. Then dividing by the total area, the 
proportion below any particular value was determined: When ,- (the 
true correlation in the universe) = 0, the summation was made from 
0 in both directions to show the proportion of all samples showing 
correlations falling below the particular r, either plus or minus. When 
p exceeds 0, the summation was made from —1.00 to increasing values, 
to show the proportion of all the samples which show correlations 
falling below any particular value. 


For each size sample investigated as described, more than 50 % 
of the theoretical observed correlations exceeded the true correlation. 
Thus for ,o=.40, with samples of 4, over 55 per cent of the samples 
showed r in excess of 40; 53 per cent with samples of 9; and about 51 
per cent with samples of 50. But with  =.80, over 61 per cent of 
the samples showed r above .80 with samples of 4, 56 per cent with 
samples of 9, and 53 per cent with samples of 25. If we define the 
value which will be exceeded by exactly half the samples as the value 
which is most likely to be observed in any given sample, this “most 
likely” observed correlation is evidently in excess of the true value. 
The problem is to determine the adjustment equation, similar to eq. 
(1), (2), or (3), which will reduce the observed value to the correla- 


tion which exists in the universe from which it is most probable that 
that sample was drawn. 


Frequency ogives (on a percentage basis) were constructed from 
the tables in the “Cooperative Study for = 0, 0.2, 0.4, 0.6, 0.8, and 
0.9, for n= 4, 5,9, 17, 25, 50, and 100. Equation (1) was then 





282 SAMPLING VARIABILITY OF REGRESSIONS 


tested against these ogives, to determine what was the significance of 
the adiustment. For zero order correlations, equation (1) becomes 


hate ( o3\i-r") 


Hence, with --0 and n‘=9,(r) would have to equal at least 
+0.35 for 7 to.be 0. Comparing this value, 0.35, with the frequency 
ogive for 7 = 0, n‘=9, it was found that only 35 per cent of the 
samples would give observed correlations larger than 0.35, or smaller 
than -0.35. Similarly for ,e=0.6 and 2=17, r would have to be 
0.63 for 7 to be .60. For these conditions, 49 per cent of the samples 
would give observed correlation in excess of .63. Carrying out this 
same comparison for all of the ogives constructed gives results as shown 
in the following tabulation. 


Size of When correlation in sample is 
sample 


(n’) 


Proportion of samples, of specified sizes, drawn from universes of 
specified correlations, which show correlations in excess of the true 
value in the universe, even after adjusting the observed correlation 
by the formula , 
Ra ee oe 

Pia /- 25 (/-r?) 

These values are determined from the graphs based on a rough 
integration by successive summations, and slight errors may have en- 
tered in making the graphic interpolations. Hence the values cannot 
be regarded as precise. The error probably does not exceed .01 or .02 
in any case, however, so the results are sufficiently exact to interpret 


the general effect of the correction formula. 


It is evident from the table that when the true correlation is high, 
‘80 or above, the probability of a value as large as that implied by 





M. EZEKIEL 283 


the use of adjustment formula (1) is practically .50.. Tests by the 
tables given in the “Cooperative Study” for the most probable value 
show that the probability becomes almost exactly .50 for larger samples 
and still higher correlations, the adjusted values by those tables and 
by the correction formula agreeing to the third or fourth decimal 
place. 


Where the true correlation is low, however, the table indicates that 
the adjustment is too severe—that is, the probability of the true cor- 
relation in the universe being as high as the correlation shown after 
the adjustment is more than .50, and may be as high as .70 (for n’= 4 
or 5 and ~=0.2). Even with this variation in the meaning of the 
adjusted value, however, equation (1) gives a valuable adjustment, 
sincé it indicates the probable correlation with almost exactly a .50 
probability where the correlation is high, whereas it indicates the prob- 
able correlation with a higher probabilitv—between .50 and .70—for 
those cases where the correlation is low and the standard error of the 
coefficient is correspondingly large. 


Comparison of equations (2) and (3) with the frequency ogives 
showed that where 7’ was small, the adjustment was more severe in 
the case of (3), and less severe in the case of (2), and did not in 
either case tend to approximate the 50: per cent probability, except 
where n’ was very large. In some cases equation (2) gives corrected 
values so low that such cases are likely to occur more than 50 per cent 
of the time, and accordingly the probability would be even less than 
.50 that the correlation is really as high as shown by the adjusted 
coefficient. 


It may be concluded that equation (1) gives the most satisfactory 
simple method for adjusting coefficients of simple or multiple correla- 
tion to remove the positive bias. The adjusted value thus obtained 
may be defined as the value that most probably exists in the true un- 
verse, in the case of a high correlation, or a value slightly below the 
probable true value, in the case of a low correlation. 


The adjustment of the standard error of estimate may next be 
considered. When a standard deviation, o, , is calculated from the 
items in a sample of 7’ cases, the probable standard deviation of the 
items in the universe, o,, may be computed (following Fisher) as 


Ox 


_ NG, 
“A=7 











284 SAMPLING VARIABILITY OF REGRESSIONS 


So if the standard error of estimate is calculated by the usual 
formula 


S=6, (/- R*) 


but the adjusted correlation, © , is substituted for ©, and the value 
just shown is used for o,,, the equation becomes 
Ps Nn 2: n’- / a 2). 
92 = ATT ntnyl s )| 


(4) tae 
S= ee (/-R*) 





This is identical with the equation given by Fisher’, though in 
different form. 


3. CoRRECTING FOR BIAS WITH INDEXES OF 
(CuURVILINEAR) CORRELATION 


Where correlation is measured with respect to curvilinear regres- 
sions, the greater number of parameters represented in the regression 
curve increases the tendency for the observed correlation to exceed 
the actual and requires a more drastic correction of the observed values. 
Where the regression curve is determined by a definite equation, the 
number of paratneters is known, and the observed correlation may be 
adjusted to the most probable true correlation by the use of equation 
(1), as before. Since the number of parameters, rather than the 
number of independent variables, now becomes of moment, the equa- 
tion may be restated for curvilinear correlation 


’ 


n-/ 
n’—m 


a 





(/-2?) 


using m to designate the number of parameters, and and P to 
designate the observed and the adjusted index of correlation. This 
formula may be used either for simple or for multiple curvilinear cor- 
relation. Thus if the regression equation 


X,= a+bX,+b6, (X 7)+ bX, + b.(X5) 


1. Fisher, R. A., Statistical Methods for Research Workers. 1928. P. 117, first 
equations; page 135, 2nd equation. 


M. EZEKIEL 285 


had been fitted, m would equal 5. For a sample of 20 observations 
and an observed multiple correlation of 0.80, the most probable true 
correlation would be but 0.74. 


Where the regression curve or curves have been fitted free-hand, 
the observed correlation may be even more in need of adjustment than 
where a definite equation has been employed.’ 


It is true that the number of parameters which it would take to 
duplicate the free-hand curve by a definite mathematical function can- 
not be exactly determined without finding some equation which will 
exactly represent the curve. On the other hand, even an approximate 
estimate of the number of parameters which would be required pro- 
vides a better basis for judging the probable true correlation than 
does the observed correlation taken alone. Such an approximate es- 
timate may be made by considering how many degrees of position, 
change, or movement are represented in thé graphic curve. The follow- 
ing list suggests some of these: 


(a) Position 

(b) Direction 

(c) Change of direction 

(d) Change in the change of direction 


- 


Where several different free-hand regression curves have been 
obtained by the method of successive approximation, the number of 
parameters represented by each one must be estimated separately. Only 
a single “position” parameter is required, since the origin of each 
regression is purely arbitrary, depending upon the constant in the 
regression equation, and the origin assumed for each of the other 
curves. That is, in the curvilinear regression equation 


X,=a+f (Xq) +F(X9 + Xs) 


the value of & depends upon the origin used in graphing each of the 
functions. 


Once the number of parameters represented in the regression 





1. Ezekiel, Mordecai. Application of the Theory of Error to Multiple and Curvi- 


linear Correlations. Jour. Amer. Stat. Assoc., March, 1929, Supp., pp. 99-104. 
Vol. XXIV, No. 165-A. , 













286 SAMPLING VARIABILITY OF REGRESSIONS 






equation has been estimated, equation (4) may be used to adjust the 
observed correlation. Until more exact information is available, the 
explanation of the precise meaning of the adjusted value which has 
just been developed for the coefficient of linear correlation, may be 


assumed (by analogy) to apply to the adjusted index of (curvilinear) 
correlation as well.’ 

















4. SAMPLING ACCURACY IN COEFFICIENTS OF CORRELATION 


Although equations (1) and (4) may be used to find the most 
probable correlation in the universe from which a given sample has 
been drawn, they do not give any measure of the range within which 
the true value probably lies, for any specified degree of probability. 





It has long been recognized that coefficients of correlation, com- 
puted from small samples drawn from a universe in which some cor- 
relation exists, show a very skew distribution. Even for samples of 
a size most used in actual research—up to n = 100 or larger—the 
distribution is so skewed that the computed standard error of the cor- 
relation coefficient is of relatively little value. Even with fairly large 
samples the chances of the observed value departing from the true 
value by four or five times its standard error are very much greater 
than any interpretation based upon the normal curve would indicate.? 



















Recent investigations by “Student” and by R. A. Fisher have de- 
veloped means of determining the reliability of correlation coefficients 


1. The adjusted correlation corresponding to a given observed correlation, for any 
size of sample and value of ™, may be more readily determined from a graphic 
chart, instead of eq. (1) or (4). Such a chart is shown in the appendix to 
“Methods of Correlation Analysis,” by the present author, page 404. (John 
Wiley and Sons, 1930.) | 


2. “Student,” On the Probable Error of a Correlation Coefficient. Biometrika, 
Vol. VI., p. 302, 1908 
Soper, H. E., On the Probable Error of the Correlation Coefficient to a Second 
Approximation. Biometrika, Vol. IX, p. 91, 1913. 
Fisher, R. A., Distribution of the Correlation Coefficients of Samples, Bio- 
metrika, 10, p. 507, 1915. 
Soper, H. E., A. W. Young, B. M. Cave, A. Lee, K. Pearson. Distribution of 
Correlation Coefficients in Small Samples. Appendix 11, to the papers of “Stu- 
dent” and R. A. Fisher. Biometrika, X1, p. 328-413. 


M. EZEKIEL 287 


while allowing for the skewness of their distribution. That phase of 
the subject will not be developed in this article; it is referred to here 
merely to call attention to the fact that even after the most probable 
value for the true correlation has been determined, it may still be 
necessary to take account of how much confidence can be placed in 
that value—of how far the correlation obtained from the sample, even 
after adjusting as suggested, is likely to vary from the true correlation 
of the universe for any stated odds of probability.’ 


It must be recognized that the interpretation of the reliability of 

a correlation merely serves to indicate the significance that may be 
attached to the observed correlation, in view of the possibility of varia- 
tion of the observed value from the true value in the universe due 
solely to random variation in sampling. If the conditions under which 
the sample is obtained do not fulfill the assumptions of simple sampling, 
then obviously Fisher’s methods cannot be used unless the necessary 
reservations or modifications are added. 

1. Fisher, R. A, On the “Probable Error” of a Coefficient of Correlation Deduced 

from a Small Sample. Metron, 1, No. 4, p. 3, 1921.—Statistical Methods for 
Research Workers, pp. 159-175, 2nd edition, 1928.—The General Sampling 
Distribution of the Multiple Correlation Coefficient. Proc. Roy. Soc., A. Vol. 
121, pp. 654-673. 1928. 
The methods developed by Fisher in the last of these articles have been made 
more readily available by the construction of graphic charts, both for simple 
and multiple correlations, which are given in the present author's “Methods of 
Correlation Analysis,” pp. 400-403. 





PART II—LINEAR AND CURVILINEAR REGRESSIONS 


1. SAMPLING VARIABILITY OF LINEAR REGRESSIONS 


Relatively little attention has been given in practical research work 
to the reliability of the regressions determined. Many correlation 
studies, especially where multiple correlation has been employed, have 
been misinterpreted because proper attention has not been given to 
the standard errors of the regression coefficients. As was pointed out 
recently,’ this sampling variation may readily be so great in practical 
work as to invalidate the conclusions as to the effect of various vari- 
ables, even when samples of considerable size are employed. 


Fortunately, regression coefficients, derived from finite samples 
selected by random sampling, tend to be distributed in a normal dis- 
tribution in the same way as does the arithmetic mean, so that elab- 
orate devices necessary to allow for skewed distribution are not nec- 
essary. If the necessary corrections are made for the failure of the 
distribution to be normal when the number of degrees of freedom falls 
below 30, the standard error of a linear coefficient of gross regression 
or of partial regression may be employed with only the same restric- 
tions as apply in the case of the arithmetic mean. More recently the 
formula for regression errors has been extended by Working, Hotel- 
ling, and Schultz to develop the standard errors of each constant for 
curves fitted by least-square methods.” 


Where the regression is represented only by a plotted curve in- 
stead of by a definite equation, the reliability of the curve has been 
unknown. Obviously, it cannot be estimated from the constants rep- 
resented in the curve, for they are unknown, and only their number 


1. Ezekiel, Mordecai. The Application of the Theory of Error to Multiple and 
Curvilinear Correlations. Jour, Amer. Stat. Assoc. Proceedings, 19th annual 
meeting, Vol. XXIV, No. 165-A, pp. 99-104, March, 1929. 


2. Working, Holbrook, and Hotelling, Harold. Applications of the Theory of 
Error to the Interpretation of Trends. Jour. Amer. Stat. Assoc. Proc., Vol. 
XXIV, 165-A, pp. 73-85, March, 1929. 

Schultz, Henry. Discussion of above paper. pp. 86-88. 
Schultz, Henry. The Standard Error of a Forecast from a Curve. Jour. Amer. 
Stat. Assoc., June, 1930. 





M. EZEKIEL 289 


may be roughly estimated. Some knowledge of the variability of such 
regression curves may, however, be obtained experimentally. 


2. OUTLINE AND SUMMARY OF EXPERIMENTAL STUDY OF SAMPLING 
VARIABILITY OF MULTIPLE CURVILINEAR 
CORRELATION RESULTS 


The study was conducted by first constructing a set of data in 
which a dependent variable, X,, was related to several independent 
variables according to known curvilinear regressions, and in which a 
ceytain known portion of the variance of X, was not related to any 
of the independent variables. A second universe was then constructed 
with the same underlying functions, but with a different proportion 
of random variation in the dependent variable. Successive samples of 
various sizes were drawn at random from both “universes” and net 
(partial) regression curves and indexes of multiple correlation were 
computed separately for each sample. The net regression curves ob- 
tained in successive samples of the same size were compared with the 
true curves and with each other to see how far the results determined 
from the samples differed from the true values, and how much vari- 
ance there was among them. The variability of the curves, for samples 
of different size, different true correlations, and different points along 
the curves, was then studied, and it was found possible to construct 
an error formula to estimate the standard error of the regression 
curves from the values obtained in the individual samples. Checking 
this formula by applying it to each of the samples previously deter- 
mined, the actual errors were found to be in fair agreement with the 
estimated errors. 


For a more rigorous test of the new error formula for regression 
curves, two new synthetic universes were constructed. Samples of vari- 
ous sizes were drawn from them, net regression curves computed sep- 
arately for each sample, and the actual departures of the computed 
curves from the true curves checked against the error indicated by 
the new formula. The agreement in this test was not so good as in 
the previous case, although 66.5 per cent of the ordinates of the curves 
showed errors no greater than their computed standard errors, only 
20.3 per cent fell between 1 and 2 times the computed values, while 
7.5 per cent fell between 2 and 3 times, as compared to 68.3, 27.2 and 
4.3, the proportions to be expected if the distribution were normal. 











290 SAMPLING VARIABILITY OF REGRESSIONS 


On the other hand, 5.8 per cent of the ordinates had errors exceeding 
3 times the computed standard error, and some departures in excess of 
5 times the computed standard error were obtained. It is evident from 
these results that either (a) the tentative formula is not adequate to 
estimate the standard errors of regression curves determined by the 
free-hand method, or (b) that net regression curves obtained by the 
successive approximation process are so unstable that their errors can- 
not be represented by a normal curve, and possibly may be impossible 
of estimation by any mathematical process. In the hope that the atten- 
tion of others may be drawn to this problem, and a more satisfactory 
error formula be obtained, the experimental study is given subsequently 
in as full detail as possible. 


The indexes of multiple correlation obtained from successive 
samples of the same size, were studied with respect to (1) bias and 
(2) variability. As has been previously reported’, the indexes of mul- 
tiply correlation show an average positive bias even larger than that 
of coefficients of multiple correlation. Indexes of multiple correlation 
apparently require a correction which takes into account both the num- 
ber of observations and the estimated number of constants represented 
in the regression curves, according to equation (4) already discussed. 
Further study of the variability of the correlations showed that as far 
as could be judged from the relatively small number of replications of 
each size sample (5 to 16) they tend to have a standard error of the 
order of 


_ 9") 


“) “~~ nm 

where n’ and ™ have. the same meaning as for equation (4), and 
where / represents the observed index of multiple correlation. If 
this very rough approximation for their sampling errors is found ade- 
quate, it would seem logical to expect Fisher’s determination for the 
sampling error of multiple correlation coefficients to apply equally well 
to indexes of multiple correlation. 


In concluding this summary, it must be reiterated that these con- 
clusions are only tentativhe. They provide at least some indication 
of the reliability of curvilinear correlation results, for which previously 


1. Loc. Cit., Proc. Amer. Stat. Assoc., March 1929, p. 100. 





M. EZEKIEL 291 


nothing had been known. The error formulae are only first approx- 
imations, however, and in the case of the error of net regression curves, 
are such a poor approximation that much more work remains to be 
done before the results of such analyses can be used with anything like 
the degree of confidence that can be felt in older and more well-estab- 
lished statistical procedures. 


. 


DETAILS OF EXPERIMENTAL STUDY 


3. CONSTRUCTION OF SYNTHETIC UNI\E«SES 


The set of data used in the initial sampling was constiucted as 
follows: 


1. Values for X, were obtained by taking the sum of values from 
two dice. The throws were repeated 500 times, giving 500 values. 


2. To insure some curvilinear correlation between X,and X,, 
values of X, were computed for each value of X,, according to the 
following function. 


Value of. ‘| Value of ~~ Value of | V alue of 


Xe. Xx; x, 


8 
9 


11 
12 


One die was then thrown, and the value for X, computed as the 
dic ceading +X, [2 -die reading + £(X,)| . 

3. Values for X, were then computed for each value of X; 
according to the following function: 





SAMPLING VARIABILITY OF REGRESSIONS 


Again, one die was thrown, and the reading of the die added ‘o 

the Xj value to get X,. This gave a set of 500 values of X,, X,, 
and X,. fairly normally distributed, with positive correlation between 
X, and X5 (r= + .534); with a negative correlation between X, 

and X, ( r-— .489); and between X, and X, ( r=-— 234) ; and 


with all of the inter-correlations more or less euteliteias. 


4. Values for a dependent variable, x, , were then calculated 
according to the relation 


X,= F(X) +f (X,) +4 (XK) +e, 


where the values for each of the functions were read from the assumed 
regression curves tabled below, and where e was obtained by throwing 
two dice, and taking the sum of the readings. 





M. EZEKIEL 


VALUES FOR ASSUMED REGRESSION CURVES 


nN — 


—-OOON ANS Ww 


L 
1 


Values for a second dependent variable, Y , were obtained by using 


the same assumed regressions, but obtaining the value for e by throw- 
ing a single die, rather than two dice. This gave two sets of 500 
observations, both identical as to the independent variables, but with 
different dependent variables, and with the true correlation higher in 
one universe than in the other, since the dependent variable included 
a smaller proportion of random variation in one case than in the other. 
The complete set of 500 paired observations are shown in Table A. 


4. Drawinc RAnpom SamMPLES 


Thirty-one separate samples were drawn from each of the 2 “‘wni- 
verses”; 5 samples of 100 observations each; 10 samples of 50 observa- 
tions; and 16 samples of 30 observations. In making the drawings, 
slips numbered from 1 to 500 were mixed in a box, and drawn at 
random. They were stirred afresh between each drawing. In making 
the drawings for the X, universe, the slips were not returned to the 
box until each sample was completed ; so that the same set of data would 
appear only once in each sample. In making the drawings for the Y 
universe, each slip was returned to the box as soon as its number was 
noted. Ina few cases this resulted in the same observations appearing 
twice in the same sample. While 500 is not an “infinite” universe as 
















294 SAMPLING VARIABILITY OF REGRESSIONS 





compared to a sample of 100, the difference in the method of drawing 
appeared to make no practical difference in the variability in the two 
sets of samples. However, the fact that the samples made an appre- 
ciable proportion of the “universe” would mean that the variability in 
the observed results would not be quite as large as if drawn from an 
infinite universe. Using Bowley’s statement of this the maximum 
effect’, however, which would be for the samples of 100, would make 
the of the observed deviations about one-tenth smaller than it would 
have. been if determined by drawings from an infinite universe of sim- 
ilar characteristics. 





For, following Bowley, 
6, = 0,//-n,/n, 


O,'=6 of actual sample, from a finite universe 
















Where, 


6,- © ofa similar sample, from an infinite universe 
n, = number of cases in sample 


n,- number of cases in the finite universe 


Hence where 7,= 500, n, = 100, then o,, = 8940, 





Since the effect of the limited universe on the variation in the 
results can thus be estimated, the results can be transformed to what 
they would probably have been had a much larger universe been avail- 
able for study. 


5. CURVILINEAR REGRESSIONS DETERMINED FROM THE SAMPLES 





Net regression curves were determined for each sample by the 
method of successive graphic approximations, and indexes of multiple 
correlation were computed for each set of curves. Each sample was 
carried through successive approximations until no further significant 
increase in correlation was found by further modifications of the curves. 
From 2 to 4 approximations were necessary, in various cases. The 
multiple correlation found for each sample at the first (linear) solution, 





1. Bulletin Int. Institute Statistics, Proceedings, Rome, 1925. Annex by A. L. 
Bowley, Cambridge Univ. Press. 


M. EZEKIEL 295 


and for each successive set of curves, are shown in Table B. For the 
Y universe, a multiple correlation was run to adjust, by least squares, 
the slope of each regression curve according to the formula. 


Y¥-a+bz[F (K+, | F(X] +b. [F (X)] 


The indexes of multiple correlation (necessarily higher than the 
previous indexes) as found by this process are also shown in Table B. 
The further study of the sampling variability of the regiession curves 
was based on the set of regression curves for each individual sample 
which showed the highest correlation for that sample. 


6. Errors iN REGRESSION CURVES FROM THE SAMPLES 


The net regression curves determined from each successive sample 
were all put on a comparable basis by adding a constant to each so that 
the central ordinate of each would equal the central ordinate of the 
corresponding true regression curve. The differences between the ad- 
justed ordinates at other points along the curves and the true ordin- 
ates would then show the errors in the curves. That is, the difference 
between ordinates at the central value and the ordinates at other points 
along the curve, as shown for the curves determined from the samples, 
were compared with the same differences for the true curves. 


This procedure centered attention on the reliability of the slope 
and shape of the curves, rather than on the accuracy of their position. 
It is true that in linear correlation, the @ as well as the } of the for- 
mula Y= a+bz , is subject to sampling errors, and formulae have 
been devised to compute its standard error. In the present case, how- 
ever, it seemed desirable to first solve the problem of the shape and 
slope of the curve, before attacking the further problem of its position. 


The departures of the curves found in the several samples from 
the true vaiues for each curve are shown in Table C, for selected or- 
dinates. The central point of reference (and therefore the point of 
Oerror) was taken at approximately the mean value of each independent 
variable. 


The individual samples were studied to see if there was any rela- 
tion between the correlation observed in individual samples and the 
; 
1. See pages 445-447, Dec. 1924, Jour. Amer. Stat. Assoc., for the original dis- 
cussion of this process. 


































296 SAMPLING VARIABILITY OF REGRESSIONS 





errors in the regression curves. No relation whatever was found 
between the size of the correlation in the individual sample and the 
size of the errors for the sample so long as samples of the same size 
and drawn from the same universe were compared. 


Standard errors for the linear partial regression coefficients were 
computed for each sample by the standard formula given by Yule, and, 
modified, by R. A. Fisher: 


o? ” 6? (1- Ria54) 


*n.34 © '6j (I- RE 44) 


When the actual errors in the regression curves for individual 
samples were compared with these standard errors, again no relation 
was found for samples of the same size and drawn from the same 
universe. For that reason it was decided to abandon further study of 
the characteristics:of individual samples, and instead study the charac- 
teristics of each entire set of samples of the same size and from the 
same universe. 


7. DERIVATION OF TENTATIVE ERROR FORMULA 


Study of the errors showed that, so far as could be judged from 
the limited number of observations, they had a marked tendency to a 
normal distribution. However, to prevent undue weighting of single 
extreme cases, the average deviation was used instead of the standard 
deviation as a basis for summarizing the results shown by different 


samples of the several sizes. These average deviations are shown in 
Table 1 (page 298). 


Each of these results would be expected. The true standard error 
of estimate for Universe X is 2.39, and for Universe Y is 1.80, or 
75.3 per cent as large. It would therefore be reasonable to expect that, 
other things being the same, the errors in the ordinates of the regres- 
sions for Universe Y would average only three-quarters as large as 
the corresponding errors for Universe X . Stating each mean error 
(Table 1) in Universe Y as a percentage of the corresponding mean 
error in Universe X , and taking the geometric mean of these per- 
centages, it appears that on the average the errors in Universe Y are 
78.5 per cent as large as in Universe X , or in fair agreement with the 



















M. EZEKIEL 297 


proportion expected. The extent to which average error shown in 
Table 1 for the selected ordinates in Universe X are correlated with 
the average error for the corresponding ordinate in Universe Y are 
shown graphically in Figure 1.' It is evident that the individual group 
averages agree fairly well with the expected relation. Accordingly, 
it was concluded that any formula for the standard error of net regres- 
sion curves would have, for one component, 5,,,, , the standard error 
of estimate for the dependent variable, just as does the formula for 
the probable error of a linear net regression coefficient, which is 
z S$ 


<< 
oO - 1.234 
bi234 0 ©61n'o2 I-Ry34) 


TABLE 1. 


Average deviation of errors in net regression curves, at selected 
ordinates for various sizes of sample. 


Universe X Universe Y 


10 5 16 10 $s 
samples | samples | samples | samples | samples | samples 
of 100 | of 30 | of 50 


nodes = © NI [ON Ur & 


OANA wmf WwW 











SAMPLING VARIABILITY OF REGRESSIONS 


It is evident from Table 1 and from Figure A, which shows the 
data graphically for Universe X , (a) that in general the larger the 
sampie the smaller the average error; (b) that the further from the 
center ordinate, the larger the error; and (c) since the errors in Uri- 
verse X were usually larger than in Universe Y, that the lower the 
true correlation, the larger the error. 


The influence of sample size may next be considered. The num- 
ber of observations is involved in two ways in the results shown in 
Table 1. In the first place, the average error tends to vary somewhat 
inversely with the size of sample. But in addition, it tends to vary 
with the distance from the central ordinate. Since the independent 
variables were composed of elements derived from dice readings, their 
distribution was roughly normal. As a result, the number of observa- 
tions upon which the regression curves were based was largest toward 
the center portions, and thinned out toward the extremes. In the graphic 
approximation method of determining the curves, each portion of the 
curve is determined from the cases falling within that portion, rather 
than from all the cases as a whole. Accordingly, it seems logical to 
try to relate the observed differences in the average deviation of the 
errors to differences in the number of cases from which they were 
determined, rather than to the total size of sample. 


There is no precise range within which the observations can be 
said to be considered in free-hand fitting. Instead of trying to meas- 
ure the exact number of cases within any specified range, therefore, it 
seemed desirable to establish a measure of the concentration of observa- 
tions at any point along the curve. Thus, for example, if within a 
given interval of X,, with a group interval of U units, there are Nv 
observations, we can express the concentration of observations at the 
mid-point of that group by the relation 


Ox 
u Us 


If the group-interval is taken equal to the standard-deviation of 
the variable, n, will be simply the number of cases falling within 
that group. If, however, the group-interval is made either larger or 
smaller than the standard-deviation, this equation will measure the con- 
centration of observations in terms of the number per standard-devia- 
tion range. In a rectangular distribution, changing the value of uy 





M. EZEKIEL 299 


would change the size of 1, to a corresponding extent, so the value 
of n, would be independent of the group-interval selected. In a 
normal distribution, however, n, would be only an approximation of 
the true value which would be secured from the theoretical distribution 
when the total number of cases was made very large and wu, was 
made infinitely small. ° 


On the basis of the foregoing reasoning, it was thought that the 
differences in the average deviations within each universe as shown 
in Table 1, might be explained by differences in the number of cases 
which each portion of each curve was based upon. In sampling theory 
the dispersion of values of a constant determined from successive 


samples ordinarily varies with + , rather than with 4. , hence, in 


a A . / 
this case, it was tentatively assumed that the value Vn would be a 
& 


component of the formula for the error of ordinates of regression 
curves. This hypothesis was tested by adjusting the average shown 


in Table 1 by multiplying each of these by the factor Tit , determin- 


ing the , in each case from the true distribution of that variable in 
the whole universe, and from the total number of cases in the samples. 
These average differences would presumably reflect the true distribu- 
tion of each independent variable in the original universe, since the 
variations in distribution in different samples would tend to cancel out. 
We may therefore use the distribution of the entire universe to in- 
dicate the average distribution within samples of specified sizes drawn 
from that universe. The calculation of m, for each ordinate in ac- 
cordance with this method is shown in Table 2. 





SAMPLING VARIABILITY OF REGRESSIONS 


TABLE 2 


Calculation of n, values for selected ordinates and 
various sizes of samples. 


Number of cases (Mu) Value of n,?! 
Group | In entire 
Universe 


35 
53 


3 


3 
5 
9 
1 
X 
5 
7 
1 
3 
X4 
2 
3 
4 
6 
7 
8 
9 


OG; 
1. Computed from formula Nye= Ny (2) , with U,=1, 6, = 2.455; 
63 = 2.1644; oO, = 2.098, vx = 1, since the frequencies for 3 include 2.5 
to 3.5; for 5, 4.5 to 5.5, etc. 





M. EZEKIEL 


TABLE 3 


Average deviation of errors in net regression curves, at selected ordin- 
ates, adjusted to error per unit observation per standard-deviation range 


Group Universe X Universe Y 


x< 
w 


1 #0 


3.60 
2.65 
0.00 
2.01 
3.50 


1.78 
2.53 
0.00 
2.10 
2.06 


1.91 

1.61 

1.10 

0.95 

0.00 

1.89 

2.71 a . . , 
3.21 | 2.93 29 | 1.92 


mm ONT UT & 


— 
wns 
& 


7 
9 
11 
13 
X 
2 


a 


}OmN AU Aw 


When the values in Table 1 are multiplied by the corresponding n, 
values, from Table 2, the adjusted values shown in Table 3 are ob- 
tained. Averaging together all the values in Table 3, average adjusted 
errors of 1.89 are secured for samples of 300 cases, 2.04 for samples 
of 50, and 1.98 for samples of 100 cases. It is evident that most of 
the difference due to different sizes of samples has been eliminated. 
However, even after this adjustment, the errors tend to increase as 
the ordinate departs from the assumed point of origin at the center. 
This same relation holds for linear regression lines. The standard 
error of any point on a regression line (in relation to the origin at 
M ,* 0) is 6,2 , and hence increases directly as x increases. A line 


1. Number of observations in each of the successive samples. 





302 SAMPLING VARIABILITY OF REGRESSIONS 


continues out with the slope given it by 6, and any error in b has a 
progressive influence on the accuracy of the line. The free-hand curve, 
on the contrary, is more flexible, and does not continue in any deter- 
minate direction. Hence it would hardly be supposed that the errors 
in the ordinates of the curve would increase with increasing values of 
X_ so rapidly as does the standard error of the straight line. The 
errors shown in Table 3 may be tested with respect to this hypothesis 
by averaging, for each universe, the errors shown by the three sizes 
of samples for the several selected ordinates and relating the resulting 
averages to the departures from the assumed means. To put these de- 
partures in comparable terms for the three variables, they may be stated 
in terms of standard deviation units. Carrying these operations 
through, the data appear as shown in Table 4. 


TABLE 4 


Average adjusted deviation of errors at selected ordinates, contrasted 
with departure from origin 


— 


— 
We ON Use 


4 
2 
0 
2 
4 
4 
2 
0 
2 
4 
3 
2 
1 
0 
1 
2 
3 
4 


WO DON AMS & dX 





M. EZEKIEL 303 


It is evident from Table 4 that the average error, adjusted for 
size of sample, increased as the departure from the origin increased. 
This is shown more clearly in Figure 2, where the average error is 
plotted against the departures from the origin. This figure. however, 
indicates that the relation is not linear, as the errors do not increase 
in proportion. When the average errors are plotted against the de- 
partures on semi-log paper, however, as shown in Figure 3, the rela- 
tion is substantially linear, and is of such an order as to suggest that 
the errors vary with the square-root of the departures, rather than the 
departures themselves. The line drawn in on each chart, with such 
a slope as to coincide with the square roots, parellels the relation fairly 
well, so from this it may be concluded that another constituent of the 
error formula will be 


Units departure from origin 
Ox 


If the origin is made at the mean of X , the independent factor, 
X,, Xz, etc., this segment of the error formula may be stated (using 


x-X-M, ) 


x 


Ox 


Each of the adjusted errors shown in Table 4 may be further 
adjusted by dividing each one by fee , the value shown in the third 
column. They may also be adjusted to allow for the difference in the 
original standard errors of estimate in the two universes, as noted 
earlier. The standard error in Universe Y was 1.80 and Universe X , 
2.39, so the errors may be made comparable by dividing those from 
each universe by the corresponding standard error of estimate. Per- 
forming these two operations, the average deviations of the errors 
appear as shown in Table 5. These average deviations are now so 
adjusted as to eliminate differences due to (1) number of observa- 
tions in each portion of the distribution, (2) departure from origin, 
and (3) standard error of estimate in the universe. As stated in 
Table 5, the average deviations are in per cent of the deviations that 
would have been estimated from an equation representing the three 
elements discussed. 

















304 SAMPLING VARIABILITY OF REGRESSIONS 


TABLE 5 


Average deviation of errors at selected ordinates 
adjusted for 7, , f= and 3, 
Ox 


Average X, -—3 








9 
‘1 
Average X, 1.00 
X,-5 | 
7 
11 
13 
Average X, 0.81 
X,-2 
3 
4 
6 
7 
8 
9 
Average X, 0.94 





Averaging all values for each variable, as shown in Table 5, there 
still remains some difference in the average errors. The errors for 
Ff (X,3) are smaller, on the average than the errors for either of the 
other variables, while those for / (X J are larger. This suggests 
that some element other than those already considered influences the 
errors, and that it differs with individual independent variables. 


The formula for the standard error of a linear net regression co- 
efficient contains the term 






eae ts 
vi — R 2.36 





M. EZEKIEL 305 


which allows for the intercorrelation between the independent variables. 
The more closely an independent variable may be estimated from the 
other independent variables, the less accurately its net regression line 
can be determined. The same relation might be expected to hold true 
of multiple regression curves. We can test this by comparing the aver- 
age adjusted errors, just computed, withh the intercorrelation, as 
follows: 





Mean Error 
i- R? 








Mean Adjusted 
Error 


















Regression 
FU) | 1.00 aeo_— 0.76 
#(X,) 0.81 7 0.68 
F(X | 0.94 {l= Woscaes 0.82 


It is evident that the means vary somewhat inversely with the 
j/—®? values. They may therefore each be multiplied by the cor- 
responding //— *? value to secure the final adjusted values, as shown 
in the last column’. This column now shows the average deviation of 
the errors actually observed stated in per cent of an estimated error 
computed from a theoretical equation composed of the four elements 


developed separately. 


The average deviations of the observed errors varies from 68 to 
82 per cent of the estimated error in each case, as contrasted to the - 
value of 80 per cent to be expected if the equation gave the standard 
error. This is consistent with the fact that the standard error of es- 
timate is included as the initial value in the equation. Furthermore, 
since the samples were drawn from a limited universe, the variation 
observed would tend to be slightly less than if they were drawn from 
an infinite universe with the same characteristics, which is consistent 


1. This demonstration is by no means convincing proof of the need of including 
this adjustment. After this final adjustment, the discrepancy between the 
smallest and largest average errors, 0.68 and 0.82, is still as great as it was 
between the smallest and largest before, 0.81 and 1.00. On logical grounds, 
however, some such adjustment for the closeness of inter-relation between the 
independent variables is necessary, and by analogy, this method seems a pos- 
sibility. It may be, however, that the index of (curvilinear) multiple correla- 
tion, 2,34, should be used in the adjustment, rather than the coefficient of 
multiple correlation, 
































306 SAMPLING VANKIABILITY OF »nZGRESSIONS 


with the observed values falling mostly a little ‘below the expected 
value of 0.80. The elements considered in estimating the error may 
therefore be said to give the standard error of the regression curves. 








By a combination of induction and deduction, of which the fore- 
going is a condensed re-statement, a tentative formula for the standard 
error of the ordinates of a net regression curve was constructed from 


the four elements developed separately. They may be combined as 
follows’: 


I, © res"(5..280) Ve Vz ae) 


or writing nm, out in full, 


IL. -(8 200) (Hs 


Hence 










Ill. 2. Siaes Ux X 


CF)” Noe (I Pg) 


8. TESTING TENTATIVE FORMULA BY SAMPLES DRAWN FROM THE 
ORIGINAL UNIVERSE 





The formula which has just been shown was derived from the 
average errors shown by all the samples, using the known facts about 
each universe—the standard error of estimate, the frequency distribu- 
tions and the standard deviations of each independent variable, and 
the ‘inter-correlations among the independent factors in working out 
the estimated errors. But for practical use in estimating the reliability 
of regressions determined from a single sample all that would be known 
about the universe would be what could be inferred from that sample, 





1. Equation (III) may be restated in a simpler form for practical computation, and 
the operations of working out the standard error for selected ordinates along 
the net regression curves may be organized in a systematic manner, as shown 
in the author’s “Methods of Correlation Analysis,” pages 384 to 389. 


M. EZEKIEL 307 


and the standard errors of the regression curves would have to be 
computed from the values so obtained. The next step of the experi- 
ment, therefore, was to calculate the standard error separately for 
each sample in turn, using only the values obtairied from each one. 
These computations were made for each independent variable for each 
abscissa listed in Table C. The actual error of the regression curve 
at that point was then compared to the calculated standard error, and 
the ratio 


Observed error 5a 
Calculated standard error T 

computed for each selected abscissa. If the computed error was the 
true standard error of the regression curve, these ratios should then 
be distributed according to the normal curve, and should have a stand- 
ard deviation of 1.00. 


The test was first applied to all the samples from both universes 
without including the term 1- #25, in the error formula. 


The standard deviation of the ratios o, was calculated sep- 
arately for each selected abscissa of each independent variable with 
results as follows: 








SAMPLING VARIABILITY OF REGRESSIONS 


TABLE 6 


Standard Deviation of Ratios of Actual Errors to Calculated Errors, 
as shown by 62 separate samples 


Value of 


independent | Errors in Errors in Errors in 
variable F (Kz) 4 (x;) 


It is evident (1) that 6, does not tend to increase appreciab.y 
zs the abscissa departs from the mean of the independent variable; and 
(2) that the results based on the errors computed from individual 
samples are on the average quite consistent with those based on the 
facts from the universe. This is shown more fully in the following 
comparison : 


Errors from - Errors from entire 
Regression individual samples universe ; mean 


adjusted error 





M,. EZEKIEL 309 


Taking 0.80 of the 6, gives an approximate measure of the 
average deviations of the 7 values, to compare with the average devia- 
tion of the adjusted errors as calculated in Table 5. The average 
deviation of the 7 values ranges from 93 to 95 per cent of the aver- 
age adjusted errors, showing the same average differences from vari- 
able to variable as were shown in Table 5 and suggesting the need of 
some element in the error formula to allow for the inter-correlation 
among ine independent variables. 


For the next step in the test, the term 1 - RP, « Was included 
in the error formula for £ ( X,) and the corresponding terms were 
included in the other formulas, using, in each case, the A values shown 
by each individual sample. Calculating the 7 values by comparing 
the actual errors with these revised estimates, and calculating their 


standard deviations, results were secured as follows: 


co, , using full 
Regression error formula 


& 

The wy, is calculated from 0 as origin, disregarding differences 
in che average error from zero. It is evident that in these sample re- 
sults the errors, on the average, are somewhat less than would be ex- 
pected from the formula, as o, falls below the unity. The distribu- 
tion of the errors is also important. Figure 4 shows the distributions 
of the 7 values and compares it with the corresponding normal dis- 
tribution. The extent of the agreement with the normal distribution 
may be judged from the following comparison: 





SAMPLING VARIABILITY OF REGRESSIONS 


Value of Normal 
distribution 


2.00 to 2.99 
1.00 to 1.99 
0.00 to 0.99 
0.00 to -0.99 
-1.00 to -1.99 
-2.00 to -2.99 
-3.00 and larger 


Although the distributions are not exactly normal, they agree fairly 
well. The different variables give slightly different distributions, how- 
ever. For ¥ ( X3), in farticular, the distribution of the errors appears 
to be skewed, with more negative errors than positive ones. This may 
be due to a slight bias in the free-hand method of fitting the curve, 
which in this instance, for a very peculiarly-shaped regression curve, 
led to a slight but persistent error in the fitted curve. This possible 
individual bias in fitting the curve free-hand will be taken up again 
subsequently. 


The test of the error formula described above was not a complete 
proof of the adequacy of the formula, since it used the same samples 
as those from which the original formula was constructed. For a more 
rigorous test the formula would have to be tried out on completely new 
samples secured from a different universe. Such a test was made in 
the next phase of the investigation. 


9. TEsTING TENTATIVE FoRMULA BY SAMPLES DRAWN FROM 
A New UNIVERSE 


A new “universe” was constructed for testing purposes, by meth- 
ods parallel to those described before. In this case only two indepen- 
dent variables were used. There were 328 observations in the universe 
and 45 samples were selected at random—-15 of 10 observations, 15 
of 20 and 15 of 40. (The number of observations was taken as small 
as 10 so as to make an extreme test of the value of the sampling form- 
ula.) Multiple curvilinear regressions were determined for f ( X, ) 





Mm. LEERICL. 311 


and f( X,), and the standard error of selecied ordinates was com- 
puted by equation (JI1). The value of 7 was then computed by 
dividing the actual errors by the expected. The distribution of these 
errors is shown in Figure 5, as contrasted with the normal curve. 


When the standard deviations of J are computed separately for 
each size of sample, the results are as follows: 


Sizeof Sampie —__ 


oil 


1.30 
1.80 


Combining the distribution for both f (X,) and f ( X3), the 
distributions of the errors for each size of sample are as follows: 


Size of sample | Normal 
Distribution 

Per cent Per cent Per cent Per cent 

of total | of total of total of total 
Over 3.00 2.0 ; 2.9 
2.00 to 2.99 3.3 | . aa 
1.00to 1.99 10.7 | 12.8 
| 31.1 
| 33.3 
8.3 | 7.8 
24 | . 5.5 

3.3 


0.00 to 0.99 34.2 
~1.00 to -1.99 

—2.00 to —2.99 | 
—3.00 nd larger 


0.00to-0.99 | 35.8 
| 
"| 


There were many more wide departures—-of 3.00 or larger—than 
would be expected if the errors had a normal distribution, with o = 
the estimated standard error. Instead of only 5 per cent of the errors 
exceeding twice the estimated standard errors, from 11 to 15 per cent 
were this large. Yet the general distribution of the errors (Figure 5) 
was in fair agreement with a normal distribution. 


Two elements may contribute to the greater variation in the actual 
errors than in the estimated. With samples of the size involved—10 to 





312 SAMPLING VARIABILITY OF REGRESSIONS 


40 cases—the shape of various portions of the curve is determined 
by much less than 30 observations, and in some cases, by 10 or Iss. 
With such small samples, Student and Fisher have shown that for 
arithmetic means and other constants, the distribution of actual error 
= estimated error does not follow the normal curve and hasa @& in 
excess of unity. It may be that some modification needs to be intro- 
duced into equation (III) to take account of this tendency before it 
can be correctly applied to small samples. From Student’s table for 
small samples’, 15 per cent of the errors would be expected to exceed 
twice the standard error if there were 3 degrees of freedom in the 
sample, and 10 per cent if there were 5. This indicates a reasonable 
number of cases, as compared with the size of the samples used in 
these tests. But whether Bee , ot some other fraction of the tota\ 
number of observations, would give the proper number of cases to use 
in entering the table, has not been determined, and more work needs 
to be done on this phase of the problem. 


A second element of error appears to lie in using er as 
; 2.34 


one element of the error formula, instead of using the index of cor- 


relation. Tr, . Substituting the index of correlation for the 
2. 


coefficient in the error formula was tried in two of the samples where 
the 7 values were the highest, and in both cases it much improved the 
accuracy of the estimated error—reducing values of 7 from 5.0 to 
3.0, from 8.3 to 4.7, from 6.7 to 3.8, etc. It would appear that wher- 
ever the inter-correlation between the independent factors is markedly 
curvilinear, the accuracy of the estimate of the error could be much 
improved by measuring that curvilinear inter-correlation, and using 
it in computing the standard error of the function. 


In view of the two sources of variation mentioned above, the fact 
that the variation of the actual errors ranges from 23 per cent to 79 
per cent in excess of ‘the variation of the estimated errors does not 
necessarily mean that the suggested formula (eq. III) is entirely in- 
adequate, but may mean only that the necessary reservations in the use 
of the formula have not been applied. On the other hand, the fact 
that the actual results do vary as widely as this from the expected 
suggests that the formula can be used only as a very tentative approx- 


1. This table is reproduced, in abridged form, in the author’s “Methods of Corre- 
lation Analysis,” on pages 19 and 392. 





M. EZEKIEL 313 


imation to the standard error of the regression curves until its pos- 
sibilities and limitations have been more definitely determined. 


10. FREE-HAND VERSUS MATHEMATICAL Net REGRESSION CURVES 


It was noted earlier that there appeared to be some tendency toward 
bias in fitting the first set of curves. The errors from the second uni- 
verse, as shown in the last set of results showed a little of the same 
tendency, with the average error not falling exactly at 0. To test 
whether determination of the regression curves mathematically would 
eliminate this bias, mathematical partial regression curves were fitted 
by least squares to one set of samples from the second universe. The 
15 samples of 20 observations were used, and two types of curves 
were fitted—the parabola and the cubic parabola. The regression equa- 
tions were therefore: 


(1) X,=a+b.X,4+6,X; +6,X,+b'X; 


(2) X =a+b,X,+b,X3+b'xX3+0,X,+b XX, +b, x2 


The estimated error was calculated for selected ordinates, using 
the same equation (III) as derive’ for free-hand methods, and 7 
and o, computed. The values of o, were as follows: 


Simple parabola Cubic parabola 


F(X) 0.77 0.95 
Ff (X35) 0.90 1.13 


It would appear, therefore, that equation (I1i) gives about as 
good results in estimating the reliability of net regression curves math- 
ematically determined as it does in estimating the reliability of those 
secured by free-hand fitting. 


Even with the curves fitted by least squares, however, there was 
some tendency to bias, as is illustrated in Figures 6 and 7. It is evi- 
dent from these figures that neither the free-hand curve nor the math- 





314 SAMPLING VARIABILITY OF REGRESSIONS 
ematical curve exactly reproduced the true curve, even on the average 


of the fifteen samples. The average amount of bias is shcwn in the 
following statement: 


AVERAGE BIAS IN FITTING REGRESSION CURVES 


Value of Average error! in IX a Average error! in J (x ) 


independent Cubic | Free-hand Cubic | Free-hand 
variable | Parabola| parabola curve parabola | parabola curve 


In this particular, where the true curve is of such a slope as to 
be fairly well repi1esented by a parabola or cubic parabola, the math- 
ematical curves appear to give a slightly more accurate fit, on the aver 
age, than do the free-hand curves. The standard deviation of the er- 
rors, however, is only slightly greater for the free-hand curves than for 
range fitted by the — ee, as shown by the following tabula- 
tion :? 


1. Taken with regard to sign. 

2. At first glance it seems strange that the regressions fitted by the cubic parabola 
should have, on the average, larger errors than those fitted by the simple 
parabola. The explanation may be that the extra constant allowed the cubic 
parabola to follow more closely the individual characteristics of each sample; 
but that in fitting those (partly random) relations more closely, the regressions 
were distorted from the true underlying relation. 





M. EZEKIEL 


Standard deviation of errors (absolute values). 


Free-hand Parabolic Cubic 


F(X) —si«9B’ 0.70 0.84 
F (Xs) 0.91 0.65 0.89 


Where the true regression is of such shape that it could aot be rep- 
resented by any simple equation, it seems likely that the free-hand 
method would give a more accurate fit than would a mathematical equa- 
tion which was not capable of representing the particular relation in- 
volved. Since, in practical investigations, the shape of the net regression 
curve is usually unknown to start with, the most satisfactory procedure 
would seem to be to use the free-hand method to determine the approxi- 
mate shape of the curves, and then, if their shape appeared to follow any 
definite types by least-squares as a final check on the shape of the 
curves. 


CONCLUSION 


This article is only a progress report. The experiments reported 
here suggest that it may be possible to develop a formula for the stand- 
ard error of net regression curves fitted free-hand. The problem has 
not been completely solved; the tentative formula which is developed 
has given only fair results in experimental tests; and several points are 
in need of further study. I hope at some future time to carry this in- 
vestigation further, but my present plans make it necessary to lay it 
aside for a year or more. I am, therefore, publishing this preliminary 
report now, in the hope that others may be led to attack the same 
problem. 


VY or. deco’ Capheca) 








M. EZEKIEL 
FIGURE A 
AVERAGE ERRORS OF REGRESSION CURVES 


Universe X 


so _ Samples 5 of 50 50 


“ Samples of /00 
Samples of 30 





4a 6 





SAMPLING VARIABILITY OF REGRESSIONS 


FIGURE 1 


CORRELATION BETWEEN CORRESPONDING AVERAGE 
ERRORS IN UNIVERSES WITH DIFFERENT 
STANDARD ERRORS OF ESTIMATE 


Average 
Error, 
Universe 


fe Se eh Soe 7 | 


pitied 


Theoretical | 
relation 
\e | 





1.00 1.50 200 


Averape Error, Universe Y 





M. EZEKIEL 
FIGURE 2 


RELATION OF AVERAGE ERRORS, ADJUSTED FOR WN, , 
TO DEPARTURE FROM CENTER 


Per cent of total Errors tor f(X,) 
frequencies 
o units of S/5 


-_— - —— ee 


Normal Lurve 


| 
} 


-| 0 \ 
Units of Tor 3S, 





SAMPLING VARIABILITY OF REGRESSIONS 
FIGURE 3 


RELATION OF AVERAGE ERRORS ADJUSTED FOR WN, 
TO DEPARTURE FROM CENTER 


Adjust 


sonora SenecAmnNieoSORNeNnrTESnriantniinnen 
Errors | 


Universe X 
P| ener 


Departure from origin 





M. EZEKIEL 


FIGURE 4 
FREQUENCY DISTRIBUTIONS OF ERRORS 


Adjusted senialenanieseasieiheanteaneddstenimesabianasibiiibie 
Errors 


Universe x 


Departure from origin 


: — —— ee 
niverse Y 


Departure from origin 





SAMPLING VARIABILITY OF REGRESSIONS 


FIGURE 5 
FREQUENCY DISTRIBUTIONS OF ERRORS 


Per cent of total Samples of 10 
frequencies 
units of 5/2 


y § eT ce ee ee ee te ae 
Normal curve 
| 


| 


ee. 
_ Normal curve 


Samples of 40 


eS ee eee 


-2 
Units of TorS,. 





M. EZEKIEL 323 
FIGURE 6 
AVERAGE CURVES FITTED BY THREE METHODS f (X,) 


To — — J 
| Y=8+box+ex 


e- — —o, eo 
+ — —_ +——" 


Free-Hand J “| 
Curve 


~ 














324 SAMPLING VARIABILITY OF REGRESSIONS 


FIGURE 7 
AVERAGE CURVES FITTED BY THREE METHODS /( x i.) 


== *+dx? 
» jp _ —_ sshenneetedillicnninesinnel 
Ysa+ a 


Sesenteed ame “i | -True curve 
‘i | 
0 





| 

| 
[ae 

| 


n 
a 
o 
@ 
3° 
no 
a 





Wf. EZEKIEL 325 


TABLE A—SYNTHETIC DATA FOR SAMPLING STUDY 











































No . Xx, 
1 5 3 94 | 104 51 8 4 i 13.7 
2 | 6 2 |142 | 92 52 | 9 3 1148 | 108 
3 | 8 4 | 160 | 110 53 | 7 3 | 19.5 | 10.5 
4 | 7 5 | 168 | 128 54 | 9 |11 | 2 |15.3 | 103 
5 | 10 | 2 |148 | 148 55 | 8 | 9 | 4 1120} 90 
6 | 6 | 6 | 185 | 125 5 | 9 | 7 | 3 1137 | 17 
7 | sii | 4 | 138 | 148 57 | 10 | 8 | 1 |120 | 110 
| 8 | 2 | 4 110 | 99 | 99 58 | 11 {12 | 5 | 201 | 121 
9 | 10 |13 | 2 | 108 | 118 39 | 4/11 | 8 |181 | 121 
1 | 11 }11 | 3 | 188 | 98 6 | 7 \u | 7 |2na | 181 
nu | 4 {10 | 8 | 177 | 157 61 | 9 | 8 | 2 |152 | 122 
12 | 4] 9 | 7 | 183 | 123 2) 8 |i1 | 2 (143 | 83 
13 | 7/913 /]110! 80 63 | 3 | 7 | 7 |124 | 104 
14} 718 | 41167 | 107 64 | 8 | 8 | 7 |170 | 150 
15 |10 |r | 2 | 143 | 103 65 | 3 | 8 | 9 |179 | 119 
1 | 11 | 9 | 5 | 15.3 | 123 66 | 9 |12 | 2 |120 | 110 
17 | 11 | 9 | 6 | 13.4 | 164 67 4/11 | 9 |176 | 166 
18 | 10 | 11 | 2 115.3 | 133 68 | 6 | 7 | 2 |149 | 109 
19 | 10 |13 | 3 | 15.3 | 15.3 69 | 7 |12 | 6 |169 | 169 
20 6|8 | 3]|144 | 94 70 2 | 8 |10 ; 
| 1 | 8 | 12 | 4 | 175 | 145 71 | 5 |10 | 4 {18 
22 | 6 | 7 | 3 | 134 | 104 722 | gs |u| 5 
23 | 10 |11 | 5 | 21a | 121 73 | 9 |10 | 2 
24/11 | 9 | 4 | 180! 11.0 74 111 113 | 5 | 
25 1717134157 | 87 315/818 
26 7 | 9 2/}115 | 85 7% | 5|7|7 
27 6 |;10 | 2 | 126 | 76 77 \ 7 | 8 | 2 
2 | 9/10 | 3) 124) 94 78 | 619 | 2 
2 | 6 | 9 | 5 | 130 | 100 7 | 6/112 | 4 
30 | 12 | 10 | 2 | 159 | 11.9 9 | 31816 
31 6! 8) 3]114 | 74 gs. | 7/713 
| 32 | 1 12 | 4 | 165 | 15.5 82 | 8| 8 | 3 
33 4) 8) 9 | 15 | 135 8 | 6/| 8 | 4 
| 34 | 9 (10 | 6 | 138 | 118 8 |} 8 }12 17 
35 | 7 [10 | 7 | 197 | 137 g5 | 12114 | 3 
36 | 11 | 13 | 1 | 126 | 136 86 | 7/1 | 4 
37 | 6 | 11 | 7 | 208 | 178 87 | 51716 
33 | 6 | 8 | 3! 84] 74 88 | 1 | 13 | 3 
30 | 5 |i | 4 | 162} 92 so | 7 {|i | 2 
40 | 719 | 6 | 164 | 144 9 | 7/11 | 4 
44} 5] 8) 4] 451] 94 o | 7 | 6 
421 617 | 2 Hes 6.9 92 | 7191] 6 
43 | 7 |12 | 6 | 189 | 169 93 | 7 | 5 
44 | 10 | 10 | 2 | 10.9 4 | 6 | 10 | 7 
45 | 418] 4 10.7 9 | 1/13 | 5 
46 | 3] 5 | 5 8.9 9% | 8 | 6 
47 | 11 | 1! 3 9.8 97 | 6 6 
48 | 7\11| 7 18.1 98 | 2 6 
49 | 10 | 13 | 4] 173! 123 9 | 7 4 
50 | 5 | 7 | 4 8.1 100 | 10 6 





326 SAMPLING VARIABILITY OF REGRESSIONS 


TABLE A—SYNTHETIC DATA FOR SAMPLING STUDY (Continued) 


_ 
DODANAWNHWOHS 


DANUONWOWEHERADNWOA 





ae 
SOWONSAPNUME OMEN os 


NOR WQNY RQ b bY Uw Nb 
CADCANACHONNUNNWON 





_ 


ONOOKONNVANNNADWHUN 


‘ 


COD DANNDWOKWDOUANAHM 


— lt 
SSONHOKOON ROS 


—_ 
ws 
S 

— ho 


— 
at 


t 


=~ 
enh 
in 


1 
5 
7 
6 
7 
7 
3 
5 
8 
10 
6 
7 
6 
4 
6 
1 
5 
3 
0 


— 
CO 
oO 


DUDA OWwWNMNH 
— 


WhOANOKHONUUADNDA UH © 
PERWAREDUUDAPANNAWA ODANNUMWHNWOWWEANAWUM 


— 


OPP hHQAUNNOYf 








_ 
MAR OWAUAENWUMNOAKDHAUNWNAFAHWNWAUONNFHAKWHNWAA 


Oc umno 0 


— 
— 





M. EZEKIEL ; 327 


TABLE A—SYNTHETIC DATA FOR SAMPLING STUDY ( “ (Continued) 


18.4 | 13.4 
20.7 | 13.7 
15.3 | 11.3 
15.9 | ~ 

| 14.8 

/ 15.1 | 

| 13.1 | 
14.0 | 

| 18.4 | 
15.4 

| 14.1 | 

| a7 | 
18.8 | 

| 15.7 | 
17.8 

| 15.7 | 

114.1 | 

| 20.5 

| 15.7 

| 13.9 

13.8 

| 13.1 
91 

| 20.1 | 

| 14.8 | 

| Ze } 

| 18.4 

| 13.8 | 

| 15.8 | 

| 14.5 | 

| 14.8 | 

| 17.8 | 

| 21.9 | 

| 18.1 

21.2 | 

10.8 ; 

19.1 | 

20.8 

20.6 


OR MNDWNANA 
qQmmanoans 


— 


ONNOUADAANOHK OWN 
— — 


—— 
COMANDeEN 
oy 
Ore phe COOWO”O 


SOAKAWeN 





— 


_ 


6 
8 
6 
8 | 9 
Sig 
7\|7 
0; 8 
7 | 9 
0 
6 


— 


MOWPhNUPONPWNODWOOWNANOUW 
Ore OOrNU 
‘- _— _— _e_— 
WANNONONN | ON 


—_— — 
CONOANNO—NO 
Gow PWUNOUBWAUVHUM 
_— — — 


mn tn 
SUA UR NWDRWWADDY HE MNWARNNOUPWALAHLHANIOAG PNA 


— 
NNO = 





NOUPAWONKRO 
_ 
SCrONWEHOWNN 


> 





328 SAMPLING VARIABILITY OF REGRESSIONS 


TABLE A—SYNTHETIC DATA FOR SAMPLING STUDY (Continued) 


= — — 
wOoComrKnmnouNnN 
ANODOAONNS 


— 


NP WNW OWNDN DAD 
WOO MH MMDUAONWENNUNODAABDANMN 





— 


CONNDOMOA=WNOCO—-WHMNOAON 








— 
CONWUONAN® 


_— 
MON MDOWN OL! 
Cnr SINNRKN 


ONNO 


ce 


3 
6 
2 
9 
1 
7 
o 
5 
7 
9 
5 
8 
2 
5 
8 
6 
8 
a 
8 
9 
e 
7 
7 
1 
4 


—" 
— 


ONOHRROSONMEDOOUN 
os 


— 


NK COMPOANSROOONWO 
CK OBANWOONWOHN 


a 





_ 


_ 
NR NOOR NWN 

! 
WE PNMADADUWON WWNWADUN UAE YARN D PNW WYNN EDA DUN FNNUWAAN OH 


CAD WH EBD DN NWA RADA WYO DNUN WUD WADA WNDAALWNWAADWUNNAWNAUUWWEN 


— 





M. EZEKIEL 329 


TABLE A—SYNTHETIC DATA FOR SAMPLING STUDY (Continued) 


| | Xa] 


NOMNN 
a!l™x | 

a 

oot 


— — 
DAwnanadKonowoun 
— 


w00 0 WOM NNON 


DWwdommmnmonHeAWwWMwnALANN SW 
DOwovoFsnouwnnvneNVOONODOEN 


SonMOON 





_ — 
HO = wm 
_— 


WOMOONNAHKNOVSP USD 
; _— _ 
Owe Ae NUON 


MYA VATO WD NW WOW DNPWNANWN SHAAN 


_ 


6 
6 
6 
9 
8 
4 
0 
7 
1 
8 
4 
6 
3 
8 
5 
4 
7 
5 
3 
il 
7 
4 
7 
6 
8 
] 
6 
5 
9 
9 
5 
5 
7 
3 
12 
8 
7 


- 


- 


mH NUN Of & 
POW PS REPENHKHUUDAOCUALREAVALUGDNUN A WWUNNUDN NN OUNEWENOWOUN 





SAMPLING VARIABILITY OF REGRESSIONS 


TABLE B—COEFFICIENTS AND INDEXES OF MULTIPLE COR 


(uncorrected for num 


res x. Xx 


Sample Ist | 2nd 3rd 
_- curves curves | curves 
= Se ec aca 


ates of rf 3 





M, EZEKIEL 


RELATION FOUND AT EACH SUCCESSIVE APPROXIMATION 


ber of variables) 


UNIVERSE Y 


P 
2nd 3rd 4th 
curves curves curves 


Samples of 30 


698 : | 715 
781 ; 794 
800 | 858 
782 | gol 
R12 | 851 
745 | 746 
877 998 
736 . 737 
679 720 
767 782 
639 676 
660 | 669 
762 785 
880 | 981 
621 619 | 642 
756 803 | 819 


Samples of 5 
786 | 803 793 
686 731 713 
730 736 738 
786 197 796 
773 .804 
764 759 723 
772 798 79 
749 Jl 477 
676 | 691 695 699 
672 | 731 =| 733 736 


Samples of 100 


769 
656 
700 
799 





SAMPLING VARIABILITY OF REGRESSIONS 


TABLE C—FOR UNIVERSE X: SAMPLING ERRORS IN NET 


(The errors are observed or- 


SAMPLES OF 30 


2 
3 
4 
5 
6 
7 
8 
9 


4 |11 }20 |3.1 





s S 
5 6 
a 4 
Ay ay 
2 a 
3 vi 





M, EZEKIEL 











SAMPLES OF 50 
SAMPLKS OF 50 








REGRESSION CURVES, FOR SELECTED ORDINATES 


dinates minus true ordinates) 

















TRANSFORMATIONS OF BIMODAL DISTRIBUTIONS 





I. INTRODUCTION 





Several mea have concerned themselves extensively with the trans- 
formation of frequency distributions, for instance, Edgeworth, Kap- 
teyn, Arne Fisher, and H. L. Reitz (see 1, bibliography). The first 
three of these men have been concerned with transformations as a 
means of extending the scope of the normal distribution and Gram- 
Charlier system as a method of description. Rietz has been more in- 
terested in the properties of the transformed distributions. 


There are three types of transformations that are of particular 
importance : 












(1) G-= x ” because it has a physical interpretation. 
(2)  wu-log x because Arne Fisher and others find it useful. 


(3) 


u =e because it is the inverse of (2). 








These three transformations will be discussed in some detail for 
bimodal frequency distributions. It is interesting to note that it is 
possible to transform a bimodal distribution into a unimodal distribu- 
tion and vice versa by means of these transformations. The general 
scheme of the first part of the following is that of H. L. Heitz (see 
1, bibliography ). 


The latter part of this paper consists of a few remarks on trans- 
formations in general. 





G. A. BAKER 


Il. THE TRANSFORMATION Uu-=x% 


In the following theorems it will be understood that one means 
a+ least one and that a frequency function is to have a total area of 


unity. 


The transformation =.” has a very clear physical interpreta- 
tion, for if the diameters of oranges are distributed as / (2c) then the 
distribution of the volumes of these oranges would be obtained by 
making the transformation u=k2%. 


Theorem I. 


Given a continuous bimodal frequency function of positive vari- 
ates y=/(2) witharange O- a@< 2<e with modes at x-4, 
x=d and antimode at z=c , (@<bec=<d<e) f(a=f@ =O 
and with a continuous derivative, then the frequency distribution 


v= @ (wu), [ De) =f u* Ft u*)] of positive variates, u- x” 
has modes as follows: 


Casel. n> / 


(1) one mode @”°< usb” always, and (2) one mode and 


one antimode c”<us d” if | (-a)f(u*)|<u% f' (ue 4) 
somewhere in this interval. 


CaseII: O<n</ 


(1) always one mode 6 “Ss u< c” , (2) a mode ani antimode 
dsu<e" if | u* £'(us) [>0-a)f (u?) 
somewhere in this interval. 

Case ITI. n<O 


(1) One mode ¢”ZuU<e” (2) one mode and one anti- 
mode 5” <u < c” if | uh F'(uh)|> (i-n) Ff (u4) 
somewhere in this interval. 


Proof: 
4 


Since u 


is taken to be positive, then if gy is to be zero we 
must have 


(1) d-n) fF (yw ®) + u® f(a) =O 









336 TRANSFORMATIONS OF BIMODAL DISTRIBUTIONS 





Also we have by hypothesis 









(2) f~@=Se-= 
3) £()- f@d=f/M-0 


and that / (x) is continuous. 


From these considerations the proof of the theorem follows quite 
simply ; for instance: 





CaseI. n>/ 











In the interval €”>u=d” (1) is rah At u- e* 


(1) is negative. In the interval c”< u < *o f' (u%) is 
positive and hence from continuity there is a maximum and minimum 
or not according as u” f (a *)<| (/- “ tf ue) - or not 


ior every w in this interval. At 97(1) is u*®/(@) which is 
zero or positive, while at 5” (1) is negative. If S£ (a) is positive 
there is clearly a maximum at the point where the sign of the continu- 
ous derivative changes from positive to negative in the interval 
a"<us6” . lf Ff ‘(a) is zero it follows that there is also a 
maximum, since Y=O at u-=a” and then increases befure decreas- 
ing at uw=b”. 














The other cases follow from exactly similar reasoning 


Theorem I]. 





In case the bimodal continuous frequency function y=/(x) (of 
Theorem I) is symmetrical about the antimodal line 2-c , then the 
mean value of u in the frequency distribution v= @ (u) of 

u= x”°(n+0 nor /)is less or greater than its median value according as 
the value of 2 lies between O and 1 or outside of these bounds. 


The, first moment of the transformed distribution is given by 
ug fu *f (u*) du Ss "f (x) dx i.e, we have @=- 4), 


where ui, is the + th moment about the origin of the original fre- 
quency distribution y=/(x). Denoting the mean value of x by Z. 





G. A. BAKER 337 


it is known’ for every set of positive values that «,) < 37”  , when 
u lies between O and 1, and that «7 > 2° when u lies outside this 
interval. 


Since Z=c when y=/ (x) is symmetrical about this line, 
the theorem follows. This follows Riewz exactly. 


Theorem III. 


In case the continuous bimodal frequency function y=7 (0 (ot 
Theorem I) is symmetrical about the antimodal line 2c , the fre- 
quency distribution v-@u) of u= 2x” (nt#0 anor /) has 
the following relations between its modes and its median. 


Case I. n>/ 


One mode, = <= median, in any case, and one greater if | Jd- anf hy) 
<u ay! (4”) somewhere in the interval c "< usd” 


Case II]. O«<1 </ 
One mode < median, in any case, and one greater if 
fw J 4 
u®f (a%)>(-n) f (u*) 
at some pomt od “xuze” 
Case WI. a<o 
One mode = median, in any case, and one less if 
4 4 £ ; ‘ 
| x. ™ ¥ (u?)|>(1- n) f (u>) 
ai some point 6 <u <c”, 
As an example of a transformation =” which cransforms 


a vimodal frequency function satisfying the conditions of the previous 
theorems into a unimodal distribution consider the following. 


Take n = 37 and f(x)--x%/2 27 5027+ 84.x-44, 
a a axs8<6€ 


f Gx- 0, then 
GQ) =n) f usu4 £'U4)-0 


1. See J. L. W. V. Jensen, Acta Mathematica, Voi. 30. (1906), pp. 180-187. 





° 
338 TRANSFORMATIONS OF BIMODAL DISTRIBUTIONS 


Instead of the variable «2% we may just as well write an x. 
Hence (1) becomes 


(2) F (x) =32 x* -396 x°+/700 x*-29402z+/984 


Calculating Sturm’s functions for (2), it is easily seen that the 


transformed distribution has only one mode and that in the interval 
o<u<-27 al 


III. TRANSFORMATION uwe=log x 
Theorem IV. 


Given a continuous bimodal frequency function (of Theorem 1) 
with a range /<a@s x< e _, then the frequency distribution 
v=@2 (uw) [ D (u) =e “ fi e* ) of positive variates u = log x has one 
mode, in any case, log ¢ s u < loge and has a mode and antimode 
in the interval log bs u< loge if |e“/ ‘e“bSe“) — some- 
where in this interval. 


dy 


This follows very simply from considering yn 


ee fle*+e“f(e%) 


Theorems similar to those stated under the transformation uw=x” 
concerning the relative position of the modes and median of the trans- 
formed distribution may be stated here. 


, which is 


As an example of a bimodal frequency distribution satisfying our 
hypothesis and which is transformed into a unimodal distribution by 
the transformation a =1og x consider 

Fo = -x "+ 16 2? - 92.27+224 2-148, /< xSx<58<7 


The condition for the vanishing of the derivative of the trans- 
formed distribution takes the form 


F (x)=-5 2*+64 27-276 27+ 4468 x -148 


By calculating Sturm’s functions for /(z) it is easily seen that 





G. A. BAKER © 339 


the transformed distribution has but one mode and that in the interval 
log 6 < u < log 7. 


IV. TRANSFORMATION u = e* 
Theorem V. 


Given a continuous bimodal frequency function (of Theorem 1), 
then the frequency distribution v-@ W,| O6W= if Clog u) ] 
of positive variates u- e* has one mode e%<u < e€* and 
has a mode and antimode in the interval e <use® if £ (log u)= 
f (log u) at some point in this interval. 


For oY = Cf" (log u) - f Cog u)] 


from this the theorem follows. 


Theorems similar to those stated concerning the relative positions 
of the median and modes of the transformed distribution in the case 
of the transformation «u =x” may be stated here also. 


As an example of a bimodal frequency distribution that satisfies 
our hypothesis and is transformed into a unimodal distribution by the 
transformation « = e~* consider 


F (x) —x4+/6 x? ~ 92 274.224 2-148, kexsxs G< 7 


The condition that the derivative of the transformed distribution 
vanish takes the form 


F (x) =24- 20 2° +140 x?- 408 x+372 
By calculating Sturm’s functions for / (x) it is easily seen 


that the transformed distribution has only one mode and that in the 
interval e’<u<e?”. 


V. TRANSFORMATIONS IN GENERAL 
Suppose that we have a frequency distribution the distribution of 


whose parameters due to random sampling we know. lf we trans- 
form this distribution what will happen to the distributions of the 















340 TRANSFORMATIONS OF BIMODAL DISTRIBUTIONS 





estimates of the parameters? It appears that, in view of the fact 
that bimodal and possibly multimodal distributions may be transformed 
by fairly simple transformations into unimodal distributions, there will 
be no simple relation between the change in the frequency distribution 
and the corresponding changes in the distributions of the estimates of 
the parameters by means of random samples. As a specific example 
of these general remarks consider the following. 


If the normal curve 







hot 
-23= 


(1) f@- ge 


is transformed by the transformation 






(2) Uu- x? 


giving 










im = 
(3) YW 2Ra lu 





Then, applying a general method for finding the distribution of 
the means of samples, first developed by J. O. Irwin (2, bibliography), 
the mean values of the u’s are found to be distributed as proportional to 


(4) 


(n+2) ---- (n+2m-2) 


n m-7 





‘ 
m 










8 
2, =7 


12 


G. A. BAKER 341 


Thus we see that, although the sampled population is J-shaped, 
the distribution of the estimates of the means ultimately approaches 
the normal distribution but that this approach is rather slow. 


It has been shown (see 3) that (4) is also the distribution of the 
estimates of the second moment by means of samples of " drawn from 
(1), the second moment of the sample being taken about the mean 
of (1). This is a special example of a general consideration that is of 
considerable interest in this connection. 


It has been shown (see 3) that, formally, the distribution of the 
estimates by means of samples of 2 of the mth moment of a popula- 
tion represented by / (x),4< x<b6  , the m th moment of the 


sample being taken about the mean of (2) is given by the solution 
of the integral equation 


ae 
(6) Fiz)-f y(n e™™ ax 
x 


where yy (x) is the unknown distribution of nm times the estimates 
of the mth moment of the population about the mean of the population 


and 4 al * 
Fia)-(/f @e™~ dz) 
e 
and if ™ is even 
aA=T9 
S= larger of 2a”, nb™ 
if m is odd 


a-=na™ 
B-nb”™ 


Now, tne formal deveiopment for finding the distribution of the 
means of samples of n drawn from a population represented by f£ (2c) 
transformed by the transformation uw - x” leads to a relation equiv- 
alent to (6) (see 2). This result may be stated as 


Theorem VI. 
If the distribution of the estimates of 7 times the m th moment 


1. This theorem permits of an obvious generalization to the case of the & th moment 
of the transformed distribution. 


























342 TRANSFORMATIONS OF BIMODAL DISTRIBUTIONS 





of a population represented by f(x) , @< x< 6 about the mean 
of f (x) exists as a solution of (6) it is identical with the distribu- 
tion of the estimates by means of samples of 7 of n times the mean, 
measured from the mean of /f (2%) , of the population represented by 
F (2) transformed by the transformation u-=2z ~, 


This enables us to formally identify these two problems so that 
anything that is true of one distribution is also true of the other. 





With other transformations the relation between the distribution 
of the means of random samples from the transformed distribution 
and the distribution of the estimates of the parameters of the original 
distribution become much more complicated. 


Further, we might say a few words with regard to the possibility 
of transforming various types of distributions into various other types. 





Suppose that £ (2) is a continuous frequency function of posi- 
tive variates, @<2<b , and that / (zx) is continuous in this 
closed interval. Now make the transformation 


(7) 








u=9D (x) 





and suppose that (2X) is such that (7) can be solved explicitly for 
z,ie, 








(8) x=~ @ 





Then f(x) dx becomes, assuming Y% ‘(@__ is continuous, 
fl yw] pwau 








(9). Uw-flywly'w 


Supposing that f is known, what can we do towards fixing the 
form of U/ by a suitable choice of Y ? 


Now, the simplest of all possible frequency distributions, from the 


G. A. BAKER 343 


standpoint of description by means of a continuous function, is one in 
which the probabilities of all values of the variate are equal. Hence 
we will suppose for illustration that 


(10) Uu)- fly wlv'w-k 


whence, putting YY w®= y 


we have 
(11) SfIW dy=ku+c 


Suppose that 


(12) f@=-ax+f 


Then 


(13) y = eS 2a (kKu+C) 


From this it is apparent that if / (x) is any polynomial whose 
degree is less than four and which is positive a= 2 <6 may, con- 
ceivably, be transformed into a rectangular distribution. 


If in place of A we were to put a specified function, say the nor- 
mal function, we would run into considerable difficulty. 


In (9) we may regard ¥ as known and then ask what forms of 
f may be transformed into certain specified forms. For instance, 
let us take 
u =lJop x 
x=e% 


Uw@m-f te”) e* 


Uw-e"| se -e“fie%)| 








| 
| 
| 





344 TRANSFORMATIONS OF BIMODAL DISTRIBUTIONS 


Now, since u>O , it is apparent that if / (x)=c that (14) 
has no zero. 


Let us put, for illustration, 0 ’ (4) =-O or UwW=k 


- 
Then SF (x) = 
However, if we were to suppose that (14) vanished at orly one 


point, at exactly two points, etc., instead of identically it would be very 
difficult to express this in terms of the form of 


VI. SUMMARY 


It has been shown that unimodal distributions may be transformed 
into bimodal distributions by means of rather simple transformations. 


This suggests that bimodal distributions are not necessarily the result 
of heterogeneity. 


The fact that a badly misshapen distribution may be transformed | 
into something that is approximately normal does not seem to be of 
much aid in determining the distribution of the estimates of the con- 
stants of the original distribution. 


The problem of transforming a specified distribution into another 
specified distribution is very difficult in gneral but could. perhaps, be 
handled to an adequate degree of approximation in special cases. 


BIBLIOGRAPHY | 


1. Rietz, H. L. On Certain Properties of Frequency Distributions of the Pow- 
ers and Roots of the Variates of a Given Distribution. Proceedings of the 
National Academy of Science, vol. 13, No. 12 (Dec. 1927), pp. 817-820. 


2. Irwin, J. O., M.A., M.S. On the Frequency Distribution of the Means of 
Samples from a Population Having Any Law of Frequency with Finite 
Moments, with Special Reference to Pearson’s Type II. Biometrika, vol. 19 
(1927), pp. 225. 


3. Baker, G. A. Random Sampling from Non-Homogeneous Populations. 
Metron, vol. VIII, Part 3 (Feb. 28, 1930), page 1. 








ERROR AND UNRELIABILITY IN SEASONALS 


By 


Fpcar Z. PALMER 


An aiiide in wie danals of Mathematical Staiisiics tor February, 
19S5L, entitled A Mathematical Theory of Seasonals,” by the Statis- 
tical Department of the Detroit Edison Company, has three objects. 
It presents a mathematical version of the time series analysis, sug- 
gests the “interpolation” method of computing seasonals, and constructs 
a theoretical time series as a test of the new method. The mathemat- 
ical analysis and the theoretical series are based upon the assumption 
that the trend, cycle, and seasonal are proportional to each other, while 
the “errors” or residuals are additive in nature. The reasoning is 
not necessarily valid for series where the cycle or the seasonal is addi- 
tive rather than proportional to the trend. 


The interpolation method as proposed consists in (1) finding the 
iotal of the items for each of the twelve months, and (2) dividing each 
total by a function which theoretically contains the trend and the cycle 
insofar as they influence the particular month. In practice this twelve- 
month function turns out to be a smooth trend curve, and the method 
of its calculation inspires little confidence that it can reflect much 
cyclical influence. The function for each month is simply a weighted 
sum of the annual totals, the weights varying for different months. 
The early years are weighted more heavily in finding the values of the 
function which apply to the first half of the year, while the later years 
are given a greater weight in the second half of the year. The func- 
tion is influenced almost solely by trend, or rather, by the difference 
between the first year and the last year of the data, since these two 
years are the only ones whose weights vary considerably from month 
to month. It is certain that no cyclical movement, however violent, 
can have the proper effect upon this function unless it affects the 
two extreme years. 





ERROR AND UNRELIABILITY IN SEASONALS 


Since the interpolation method involves dividing the monthly totals 
(for which may be substituted the monthly means) by a function which 
is mainly composed of trend, we are justified in considering it a varia- 
tion of the well-known monthly-means method.’ In the theoretical 
series of the Detroit Edison article, the means of each month, cor- 
rected for trend by the Davies method, yield a seasonal index almost 
identical with that obtained by the interpolation method (see Table I). 
It should be noted that we used the very easily computed semi-means 
line to correct the monthly means for trend. The semi-means trend 
in this series is not as steep as the theoretical trend used in the con- 


TABLE I 


SEASONALS OF THE THEORETICAL SERIES 


As Obtaind by Five Methods 


* From the Detroit Edison article. 


1. Davies, Economic Statistics (1922), p. 117. 





E. Z. PALMER 347 


struction of the series. For the purposes of this quick and easy sea- 
sonals method, however, the semi-means trend is accurate enough. 


Any one series used in the comparison of methods should, of 
course, be viewed merely as an illustration, or at most as a sample, of 
their results. A thousand such series, constructed upon a thousand 
variations in assumptions, is necessary for determinative comparisons. 
Long and short series, large and small seasonals, cycles, trends, and 
irregulars, regular and irregular seasonals, curved and straight trends, 
additive and proportional combination of the factors: each of these 
attributes introduces some elements of error into the computation of 


sasonals, and the errors are not necessarily constant as between differ- 
ent methods. 


An instance of the danger of using any single theoretical series 
occurs in the Detroit Edison article.. If we test the residual factor, we 
find that it also contains some seasonal variation. The true seasonal 
index of the series, then, is the theoretical seasonal moditied by what- 
ever seasonal is to be found in the residuals. ‘There is, as might be 
expected, some seasonal inequality in the cyclical factor as well, but 
since it is the task of the method used to eliminate this «vclical influ- 
ence, we do not consider it a part of the true seasonal. No method, how- 
ever, can be expected to distinguish between a seasonal arbitrarily 
designated as the theoretical, and one which is added as part of the 
residual factor. In Table I, the first column gives the theoretical, and 
the second column the true seasonal, of the series. 


The authors compare their results with those by the link-relative 
method, and find that the interpolation method gives an index slightly 
closer to the theoretical seasonal. For further test, we have computed 
the seasonal by the somewhat more logical ratio-to-trend-cycle method. 
We used the twelve-month moving mean, centered on the sixth month, 
to represent the combined trend and cyclical factors. Then we found 
the ratios of each monthly item in the series to the corresponding mov- 
ing mean figure, and, after arraying the ratios for each of the twelve 
months, we found the modified median of each array. This is approx- 
imately the method suggested by the Federal Reserve Board,’ and gives 
a seasonal with very much less error than the previously mentioned 
1. Joy and Thomas, The Use of Moving Averages in the Measurement of 

Seasonal Variations, Journal of the American Statistical Association, vol. 23, 
p. 241. 




























348 





ERROR AND UNRELIABILITY IN SEASONAILS 


methods. (See Table II) 





The ratios to the moving mean may be expected to be free from 
both trend and cyclical influence. However, the moving mean has its 
faults, especially in its tendency to cut corners when the true cycle 
makes a sharp change of direction, and in its failure to extend to 
the ends of the series. For this reason we made another computation, 
using a corrected moving mean. The data and the moving mean were 
graphed together, and a free-hand curve was drawn (without refer- 
ence to the theoretical trend-cycle curve given in the article), correct- 


TABLE II 


ERROR OF SEASONALS 





Theoretical seasonal .. 
True seasonal 





Monthly means ..... 
Interpolation ............. .0269* 
Link relative ........ .... 0277* 
Ratio to trend-cycle: 

Moving mean ..... ... 0208 


With free-hand correction .0164 


* From the Detroit Edison article. 





ing the moving mean in three places and extending it to the limits of 
the series. The seasonal index computed from the ratios to the new 


trend-cycle curve had an error about half that of the interpolation 
method. (See Table IT) 







When W. I. King’ first proposed the use of a free-hand curve in 


1. King, An Improved Method for Measuring the Seasonal Factor, Journal of 
the American Statistical Association, vol. 19, p. 301. 


E. Z. PALMER 349 


this connection, objection was raised that this introduced the personal 
equation into what should be mechanically determinable. In many 
series, however, any experienced statistician would draw a curve which 
would fit the data better than the moving mean. There is not as much 
discretion involved in drawing the curve as there is in choosing be- 
tween two mechanical methods. The possible error due to the per- 
sonal factor is very much less than the error made certain by the use 
of any more mechanical method. 


An important test of the reliability of the methods of finding 
seasonals, which may be applied to actual series where the true seasonal 
is unknown, consists in an examination of the monthly arrays. The 
monthly-means and the interpolation methods depend for their relia- 
bility upon the distribution of the arrays of the original data from 
the means of each month. Similarly, the link-relative method depends 
upon the scatter of the link relatives about their medians, and the ratio- 
to-trend-cycle methods upon the arrays of ratios. We measured the 
dispersion for each month for any method by the mean deviation of 
its array about its central tendency. This should be divided by the 
central tendency itself to obtain a relative dispersion measure for that 
month. Then the mean of all twelve dispersion measures was taken, 


TABLE III 


UNRELIABILITY OF SEASONALS 


Relative mean deviation * 

Method of monthly arrays 
Monthly means il Ned .. 1838 
Interpolation nea 1838 
Link relative .. er ' 0734 
Ratio to trend-cycle: 

Moving mean... . .0621 

With free-hand correction . . .. 0492 


to give an indication of the unreliability of the method as a whole. 
The great unreliability of any method based on the monthly means 
is apparent from Table III, as well as the superiority of the ratio-to- 
trend-cycle method with free-hand correction. 


Spee sed hel ee te a ERLE A AIEEE Oe 


1 ers 





350 ERROR AND UNRELIABILITY IN SEASONALS 


Some question may arise concerning the propriety of submitting 
the link relatives to this test of reliability, because the manipulations 
to which the medians are subjected before they emerge as a seasonal 
index may decrease the error inherent in the spread of the monthly 
arrays. We have not been able to derive the algebraic relationship 
between the mean deviation of the link relatives and the corresponding 
unreliability of the final seasonal indexes based upon them. In erratic 
series, the process of computing link relatives tends to heighten the 
spreading effects of rapid changes in direction; conversely, cumulative 
multiplication of the median link relatives possibly decreases the error. 
If so, the link-relative method is not as unreliable as Table III would 
seem to show it. 


The penalty which the computer pays for accuracy and reliability 
is, of course, a longer time of computation. The time required for 
the application of each method to the given series is shown in Table 
IV. The time allowed is for each operation to be performed twice, and 
the results checked against each other. In addition to the five methods 
used throughout this article, the time is given for a short cut to the 
best method, involving much more of the personal equation. The short 


TABLE IV 


COMPUTING TIME OF SEASONALS 


Method Time in minutes 


Monthly means 60 
Interpolation 4 110 
Link relative . we 160 
Ratio to trend-cycle: 
Moving mean 285 
With free-hand correction 495 
All free-hand curve 371 


cut consists in not computing the moving mean at all, but drawing the 
trend-cycle curve altogether free-hand. 


This timing, of course. assumes that the seasonal index is the 
whole objest of the computation. If we were finding the seasonal only 
as a part of a general statistical analysis of the series, the seasonal 





E. Z. PALMER 351 


should not be charged with the full time for the steps which are use- 
ful for other parts of the analysis. The trend-cycle curve, for instance, 
has other uses than in computing seasonals. In considering the ques- 
tion of speed, it should also be recognized that the various methods do 
not have the same relative time for series of different length, nor 
when more elaborate calculating equipment is available than we used. 


High speed of computation is not as necessary in seasonals as it 
is, shall we say, in index numbers. The calculation of a seasonal is 
a task that does not have to be repeated often for any one series. It 
is more in the nature of a capital expenditure than a current routine. 
For student theses, and for investigations where the computation of 
seasonals is merely incidental or can be roughly done, the monthly- 
means method is adequate. But for a positive study of actual sea- 
sonal influences, and for the elimination of the seasonal factor from 
indexes published currently bv research bureaux, the best method should 
be used regardless of the longer time needed. 


Lape § (au 































MODIFICATIONS OF THE LINK RELATIVE AND 
INTERPOLATION METHODS OF DETER- 
MINING SEASONAL VARIATION 


By 


RicHarp A. Ross 





In a recent paper’ the statistical department of the Detroit Edison 
Company have introduced a new method of calculating seasonal varia- 
tion in a time series. Briefly, the time series ,u, is represented by the 
function e4=s- f (®) CH) S@)+ Ey where £Cz) 
represents secular trend, c@ cycle, s@c) seasonal, and 
€, residual errors, and by the Method of Least Squares the seasonal 
variation for any one month will be given by 






2 our, JQ C@ 


AY * 
(NS SFG eI 


f- 1,2,3,...1% 





where Ss (ij) represents the seasonal variation in the ¢ th month and 
the summations in the right hand member of the equation are taken 
over the years covered by the time series. 


If the Method of Moments be used 






& dhs 


” 7” 2(f@ c@] 






The trouble lies in the determination of the denominator 
Z[f@ c@]’ or Z[/(@ + cle) ] . The Detroit Ed- 
ison have overcome this difficulty by smoothing the observed time series 
with a sixth degree parabola, keeping the total population for each 
year unchanged over a period of seven years. In this way seasonal 





1. A Mathematical Theory of Seasonals, Annals of Math. Stat., 1, p. 57. 


R. A. ROBB 353 


variation is obtained from (A) or (B), (B) being much easier to 
handle than (A). 


There appears to be an objection in fitting a curve over a period 
of seven years and thus for successive seven year intervals obtaining 
smoothed values for a time series of any given length. The ordin- 
ates of the smoothed curve are not equally weighted as, for example, 
in fitting curves over a ten-year period, the first smoothed ordinate 
for a year is given by one curve, the second by two curves, the third 
by three curves, the fourth, fifth, sixth and seventh by four curves, 
the eighth by three, the ninth by two and the tenth by one. To over- 
come this I decided that my smoothed curve should have the same 
zero, first and second moments as the observed curve over a period of 
twelve months. This simply means that a parabola of 2nd degree was 
fitted to the successive twelve month intervals, and as above a smoothed 
curve will be obtained for any length of time. 


If the observed values uy are plotted against the correspond- 
ing values of x and a parabola of second degree fitted to the points 


“ins Une’ . . . ‘aes? * . . 7 »u4, 


determining the constants by the Method of Least Squares, the ordin- 
ate of the curve at x= 0 is taken as the graduated value of uw. lf 
m= 6 this would involve thirteen observed ordinates, whereas I desire 
twelve. This difficulty, however, is easily removed by finding a first 
approximation to my graduated value by using thirteen ordinates ; hav- 
ing found the corresponding seasonal variation by (A) or (B), the 
thirteenth ordinate is divided by this seasonal factor. The parabola 
which is to represent the smoothed curve given by trend x cycle is 
then found from the twelve ordinates subject to seasonal ,trend and 
cycle influences, and a thirteenth from which seasonal has been elim- 
inated. 


The graduated ordinate at 2- 0 corresponding to uw, is (first 
approximation ) 


, / F \ 
(C) ¢,* /43 | 25 u,t+Z4 (u,* u ype") (u, , u_,) 


+/6 (u,* u_,)+9(u,+ u_,) —t (u_+U,)|- [e | 








354 MODIFICATIONS OF THE LINK RELATIVE 





For example, if we take thirteen ordinates, commencing at Janu- 


ary, 1904, and finishing at January, 1905, the first approximation for 
July is 


-11 (Jan.) 1904 + 9 (Mar.) + 16 (Apr.) + 21 (May) 
+24 (June) +25 (July) + 24 (Aug.) + 21 (Sept.) 


July = 
143 | +16 (Oct.) +9 (Nov.) -11 (Jan.) 1905 





where (1) I have designated the production for any one month by the 
corresponding name of the month, and (2) the formula is rearranged 
in a form suitable for the calculating machine. 


If formula (B) is used it is readily seen that to obtain the sea- 
sonal variation for any month we must : 
(1) Sum together all the Januaries, then all the Februaries, etc. 
It should be noted that, as the first six months and the last six months 
of a time teries are not weighted equally with the others, no gradulated 
points were found for these periods. In consequence, as will be seen 
in practice, two sets of summations of the different months are re- 
quired, the first including every year except the last, and the second 
excluding the first year. Then apply formula (C). This gives 


zr f (x) c @) | 











(2) Divide Zou, by If @® c@) 










Having obtained a first approximation to the seasonal factors, 
z[f (2c) e(x) | is recomputed as explained above. In prac- 
tice this is quickly executed, as will be seen in an example completely 
worked out below. 


To illustrate this method I have taken the theoretical time series 
given by the Detroit Edison. Summing the productions for the various 
months, we have Table I. 


To find the seasonal for July, for example, we have to find the 
value of 











~11(20434) + 9(21621) +16(22615) + 21(23035) 
ZU flx)- etx) ]-7qq |+24(21129) + 25(21508) + 24(22118) +21 (22212) 
+16(23186) + 9(21215) -11(21215) 













R. A. ROBB 


Using formula (B) the seasonal for July is a 0.965 


TABLE I 


a re ee oe ee ew ee a= ee ee ae ee 


Month 1904-1914 1905-1915 Istapprox. 2nd approx. 


20,434 21,215 971 ' 973 
19,425 20,143 918 919 
21,621 22,389 1,015 1.014 
22,615 23,196 1.045 1.039 

.. 23,035 24,231 1.061 1.062 
. 21,129 22,567 974 974 
21,508 22,820 965 * 967 
22,118 23,077 .987 .993 
22,212 23,707 1.011 1.010 
23,186 24,964 1.071 1.067 

. 21,215 22,712 987 982 
21,836 23,182 1.002 1.005 


1200.07 1200.05 


In this way the seasonals in Column 4 of Table I were obtained. 


The second approximation is obtained with little extra trouble; 
for July, on account of the thirteenth ordinate, in this case the Janu- 
ary of the following year, which has a seasonal of .971, we have to 
replace the last term in 143 2 [4 (2) e(x)| given above by 

ee ails , i. e., the recomputed Z& [* (x) c (x) | is now 


3186016 ~11(21215) (0.0299) eg one) , the reciprocal of 0.971 being 1.0299. 


The seasonals obtained with these corrections are given in Col- 
umn 5 of Table I. 


Comparing the seasonals with actual values, we have the follow- 
ing table. 


eal 





MODIFICATIONS OF THE LINK RELATIVE 


Feb. March April May 


930 $1050 1.020 1.040 
919 1014 1.039 1.062 


-011 -036 +.019 +.022 


Aug. Sept. Oct. Nov. 


Actual Seasonal ........_. “ - 1,000 980 1.040 .990 1.000 
Computed Seasonal . 993 1.010 1.067 982 = 1.005 


-007 +.030 +.027 -008 +.005 


The mean and standard deviations are compared with the Inter- 
polation Method of the Detroit Edison. 


Mean Deviation Standard Deviation 
of Errors of Errors 


0168 - 0194 
Interpolation Method 02 0337 


It will be noticed that the new method of smoothing yields a 
standard deviation which is roughly a little greater than half that ob- 
tained by the Interpolation Method. 


To test whether any actual difference in Seasonals would be ob- 
tained by using formula (A), the ordinates of the smoothed curve 


‘were found by formula (C) and are given in Table II. The seasonals 
were as s follows : 


Jan. Feb. ~*~ March. April May - June. 
.967 .988 1009 =—-_:1.072 986 1.005 


‘July - Aug. Sept. Oct. Nov. Dec. Total 
ce cts cassie 


figures practically identical with those previously obtained. 


In Figure I, I have plotted against the various months (1) the 
actual Seasonal Indices, (2) those given by the Detroit Edison Inter- 





R. A. ROBB 


TABLE II 


1905 1906 


1724 
1765 1968 
1824 
1855 1956 
1885 
1854 1954 
1880 2019 
ree 1882 2063 
September 1859 2078 
October 1867 2099 
1890 2122 
December ....... 1939 2139 


1911 1912 


1919 2157 2521 
1872 2222 2562 
1861 2293 2590 
1852 2370 2592 
1869 2441 2609 
1900 2514 2645 
1935 2540 2640 
1972 2554 2624 
September . 2167 1993 2536 2511 
October ... 2036 2008 2488 2405 
November ... 1950 2048 2477 2314 
December ... . 1931 2113 2464 2271 


polation Method, and (3) those given by the method of this paper. 


As it will be interesting to note how the smoothed values of the 
ordinates of the time series agree with the actual, I have given below 
_ the Mean Deviation of errors from actual for the various months. 


Month .. Jan. Feb. March April May June 
. a 41 50 43 32 3 


Month . July Aug. Sept. Oct. Nov. Dec. 
Oe .. . +5, ie 24 23 37 42 67 





MODIFICATIONS OF THE LINK RELATIVE 


For the whole per iod the Mean Deviation is 39.5. 


The Deviations are small, the January Mean Deviation being 
roughly 3 per cent of the mean production for January; for December 
it is 3.4 per cent. 


On the assumption that the Seasonal Index for any one month . 
is constant for a given time series, it will be seen that I east Squares 
can be used in several ways to yield Seasonals. I give one example 
of its use, obtaining Seasonals by a method closely allied to the Link 
Relative method. 


In the Link Relative method link relatives are formed for all the 
different months. This involves the greater part of the calculation, 
and it seemed feasible that instead of calculating link relatives and 
finding median values one could assume that the production for any 
one month with reference to that for the previous month is given by 


February = 9, January 
March = a, February 


December = a, November 

January = a,, December 
where, as before, the name of the month stands for the production for 
that month, and 8,,. 4,, . - .» @,,are constants which can be 


determined by the Method of Least Squares. For February = a, Janu- 
are, we have — 


a & (February) (January) 
'  £& (January) * 


the summations extending over the years of the time series. 


Considering our time series to be ++ Uy, Ung,’ 


U,U, + Uys Uy +Ugg Ung +°*** 


8,= 
. “*+us + UZ, 





Rk. A, ROBB 


FIGURE I 


—— Actual Seasonals. : 
--- New Interpolation Method. 
Detroit Edison Interpolation. 











360 





MODIFICATIONS OF THE LINK RELATIVE 





i. e., the observed productions for two successive months are multiplied 
together and summed anc the whole divided by the sum of the squares 
ot the production of the first of the months. 


















These coefficients @,, @,, . . . correspond to the median 
link relatives, and the procedure is then similar to that used in that 


method, i. e., January is assumed to be 100.0, etc. We thus get the 
following table. 


TABLE III 


(1) (2) (3) (4) (5) 

Error 

iF Chain (2) Seasonal from Actual 
3, 

} Month 

i 


Relative Adjusted Indices divided by 100 











January 951 100.0 100.0 95.5 -.035 







February 1.106 95.1 94.8 90.5 -025 
March 1.043 105.2 104.3 99.6 -.054 
April . 1.037 109.6 108.3 103.4 +.014 
‘ May ... .934 113.7 111.9 106.8 +.028 
June... .997 106.2 104.1 99.4 +.014 
, July 1.018 106.0 103.5 98.8 + .008 
August 1.019 107.9 105.0 100.3 +.003 
September 1.056 110.0 106.6 101.8 +.038 
October ...... .915 116.2 112.2 107.1 +.031 
November 1.020 106.3 102.2 97.6 -.014 
December .967 108.4 103.9 99.2 —.008 
January . 104.8 100.0 












The Standard Deviation of the Errors of column (5) is found 
to be + 0.0269, which is considerably less than that of the Link Relative 
Method. 


If we assume that , w, can be represented by the points on a 
theoretical curve ,u,= S(x) AX) clz)+Es as given by 
the Detroit Edison Statistical Department, it will be seen that Febru- 
ary 4, (January) gives 


Ss (2) ZlA@) e(x)|[F (x+/) clx+ c(x+)] 
a." st) 


~~ S[Az) c()] 










R. A, ROBE 361 


where, if 2x -=1 corresponds to the first January of time series, x=1, 
13, 25. etc. 


32 can therefore be found as soon as a value can be obtained 
for the adjustment factor rival. where W (x)=f (x) c(x) 
If the time series is smoothed’ , {fe method already discussed, satis- 
factory values of w (x)are obtained and the adjustment factors easily 
computed. The smoothed values of the ordinates of the theoretical 
time series are given in Table II. As logarithmic correction, which 
has already been employed, assumes a constant adjustment factor for 
any pair of consecutive months, it will be interesting to find whether 


the assumption of a theoretical curve for the time series yields better 
adjustment factors than the constant one used in logarithmic ccrrection. 


TABLE IV 


Chain ° 
Adjustment Relative (2) adjusted Seasonal 
Month (1) (2) (3) (4 

January 100.0 100.0 97.2 
February 994 94.5 94.4 91.8- 
March - .993 103.8 103.6 100.7 
April 991 107.3 107.1 104.1 
May .980 109.0 108.7 105.7 
June 984 100.2 98.8 96.0 
July .996 99.5 99.0 96.2 
August 1.000 101.3 100.8 98.0 
September . 104.7 104.1 101.2 
October 112.3 111.6 108.5 
November : 103.3 102.5 99.6 
December 2 104.8 104.0 101.1 
January -_— 100.9 100.0 


The adjustment factors are given in column (1) of Table IV, and 
the corresponding seasonal indices in column (4). The standard devi- 
ation of errors, + .0248, is less than that obtained with the adjustment 
as used in the link relative, but in this particular case the adjustment 
factor, using logarithmic correction, would be .996 for each month, 
differing little from the factors using smoothed ordinates. It will be 
noted that, owing to accidental errors, the chain relative for January 
is 100.9, not 100, and an arithmetical ccrrection has to be applied. 








362 MODIFICATIONS OF THE LINK RELATIVE 
From this one sees that, taking accidental errors into account, logar- 
‘ithmic correction is well adapted for reduction purposes. 


Finally, I found the Seasonal Indices by the Variate Difference 
Method. In this method the trend is removed and second differences 
taken, which are treated by Fourier Analysis. For the second differ- 
ences I obtained 


A ‘u=+0.02 + 0.822 cos ( @ - 339°52' ) + 3.958 cos (2 6-314°53)) 
+3.902 cos (3 @-39°34')+ 4.374 cos (4 @- 25°58) 
+9.942 cos (5@ - 293°30' ) + 0.775 cos 6 @. 


yielding the seasonal indices: 








Jan. 96.9 May 105.6 Sept. 101.9 
Feb. 89.2 June 99.4 Oct. 106.3 
Mar. 100.8 July 98.3 Nov. 97.8 
Apr. 103.4 Aug. 102.2 Dec. 98.3 















Dividing the seasonals by 100 and comparing with Actual values, 
Mean and Standard Deviation of errors from the actual seasonals are 


See a 


TABLE V 








SEASONAL INDICES 


Modified 
Interpolation Link Relative 


0 gz. 
Actual | Detroit Link Corec- Theoretica] Differ- 
Values | Edison Robb Relative tion Correction}! ence 




























January 
February 
March 
April 
May 
June 
July . 
August 
September 
October . 
November 
December 





nia 87 0194 | .0338 js 0248 | 

















R. A, ROBB 363 


given by + 0.022 and + V.0246, which are still considerably less than 
those obtained by Link Relative or Detroit Edison Interpolation. For 
reference I have put in Table V the results obtained from all the meth- 
ods mentiond in this paper, together with their Standard Deviation 
(S. D.) of errors. 


As the time taken to determine the Seasonal Indices by the various 
methods is important, I took the time series of Merchandise Imports 
for a period of ten years, and calculated the Seasonal Indices. Denot- 
ing the Link Relative method by &, , Modified Link Relative (Log. 

_ correction) by &,, Interpolation (as given in this paper) by I, I 
found that as regards the time taken for one determination completely 


checked #, * &.T 2115827 


For the particular example taken, it took 1.75 hours to transcribe 
the material and to determine the Indices, completely checked, by the 
Interpolation method. For the Link Relative method, 7.75 hours were 
taken. It is desirable, if possible, to have two independent determina- 
tions, and the above times would consequently have to be doubled. The 
Variate Difference method roughly takes the same time as the Link 
Relative method, when the trend has been removed from the time series. 


Kickiond A. Robt. 


Glasgow, Scotland. 





